Searching for Good Weights in a Linear Model

In [ ]:
import numpy as np
import pandas

import matplotlib.pyplot as plt

import matplotlib.animation as animation
plt.rc('animation', html='jshtml')
plt.rc('animation', embed_limit = 1e9)

First, let's find and download some interesting data. The machine learning repository at the University of California, Irvine, is a great resource for publicly available data with explanations for machine learning researchers. Here we download the air quality data set. If curl is not available on your system, you may use the above link to find and download this data. It is useful to go the link and find the page describing this data set. That page is here.

In [ ]:
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.zip
!unzip -o AirQualityUCI.zip

We will use the pandas package to read this data. The pandas.read_csv function is extremely useful for reading in all kinds of data with various peculiarities. Here are the first few lines of AirQualityUCI.csv.

In [ ]:
!head AirQualityUCI.csv

Notice a few things. Fields are separated by semi-colons. The first line is names for each variable, appearing in separate columns. Each row is one sample. Each line ends with two semi-colons. Not immediately obvious is that the decimal values follow the European convention of using a comma instead of decimal point. Not demonstrated in these first few lines is the fact that missing measurements are given the value -200.

All of these issues can be dealt with directly in the call to pandas.read_csv. I don't mean to imply that I got this right on my first try. The two most puzzling issues were the two semi-colons at the end of each line and the commas for decimal points. The double semi-colons caused the data returned by pandas.read_csv to have more columns than I expected.

Very good pandas tutorials are available.

In [ ]:
data = pandas.read_csv('AirQualityUCI.csv', delimiter=';', decimal=',', usecols=range(15), na_values=-200)
data = data.dropna(axis=0)
data.shape

So, we have 827 rows and 15 columns of data. This means that we read 827 samples that do not have missing values, and each sample contains 15 values. Let's look at the first few rows of this data matrix, called a DataFrame in pandas.

In [ ]:
data.head(10)

Let's create a simple problem for playing with this data. Let's say we want to predict the level of carbon monoxide from the time of day. The column Time contains the hour, but not just the hour. 9am will appear as 09.00.00. Whoopee! This will give us a chance to practice our skills at extracting substrings, converting strings to integers, and doing these steps for all of the Time values within a concise little list comprehension. You don't know what this is? Well, it is time to get comfortable not knowing, and typing 'python list comprehension' into your favorite web search engine.

In [ ]:
data['Time'][:10]
In [ ]:
[t for t in data['Time'][:10]]
In [ ]:
[t[:2] for t in data['Time'][:10]]
In [ ]:
[int(t[:2]) for t in data['Time'][:10]]
In [ ]:
hour = [int(t[:2]) for t in data['Time']]

To get the carbon monoxide measurements for each sample, you can read the data description at the UCI web site to learn that column CO(GT) is the ground truth measurement of carbon monoxide.

In [ ]:
CO = data['CO(GT)']
CO[:10]

Here I will introduce a convention I will follow throughout this class. Inputs to a model are given in a matrix named X. Samples are in rows, and the components, measurements, variables, thingies of each sample are given in the columns. The desired, correct outputs for each sample are given in a matrix named T, for Targets. The $i^{th}$ row of X is Sample $i$ whose correct target output is in row $i$ of T. Let's set this up for our hour to CO problem.

In [ ]:
T = CO
T = np.array(T).reshape((-1, 1))  # make T have one column and as many rows as needed to hold the values of T
Tnames = ['CO']
X = np.array(hour).reshape((-1, 1))
Xnames = ['Hour']
print('X.shape =', X.shape, 'Xnames =', Xnames)
print('T.shape =', T.shape, 'Tnames =', Tnames)

Say after me...plot early and often! We can never have too many visualizations. This next plot verifies that we have defined X and T correctly. What else do you notice?

In [ ]:
plt.plot(X, T, '.')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0]);  # semi-colon here prevents printing the cryptic result of call to plt.ylabel()

Well, what do you think? Will we be able to predict CO from Hour with a linear model? The predictions of linear model must appear as a straight line in this plot.

What is a linear model? Well, a linear model of one variable is specified with a y-intercept and a slope. These are the two parameters of the linear model. Let's call them w0 and w1. If the output of the linear model is y, then we have y = w0 + x * w1. Let's wrap this up in a little function.

In [ ]:
def linear_model(x, w0, w1):
    return w0 + x * w1

So, what values should we use for w0 and w1 to make a good prediction of CO? What method shall we use to find good values? How about good old guessing, or maybe we could call this trial and error. We will just pick some values and plot the predictions.

Manual Guessing

In [ ]:
w0 = 0
w1 = 1

Y = linear_model(X, w0, w1)

plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend();  # make legend using the label strings

Well, clearly our predictions are climbing much too quickly The slope, or w1, is too high. Try a smaller value.

In [ ]:
w1 = 0.1
Y = linear_model(X, w0, w1)

plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend(); 

Maybe too low.

In [ ]:
w1 = 0.3
Y = linear_model(X, w0, w1)

plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend(); 

Okay. Now we can try to increase the y-intercept, w0, a bit.

In [ ]:
w0 = 0.4
Y = linear_model(X, w0, w1)

plt.plot(X, T, '.', label='Actual CO')
plt.plot(X, Y, 'r.-', label='Predicted CO')
plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])
plt.legend(); 

We could do this all day. (Well, maybe a few more times.) What we really need is a way to quantify how good our linear model is doing. Let's define a function to calculate the error by calling linear_model to get our predictions, Y, and compare them to the target T values. The comparison will be done with the common root-mean-square-error, or RMSE, approach, for which the difference between T and Y is squared, averaged, and the square root of the result is returned.

In [ ]:
def error(X, T, w0, w1):
    Y = linear_model(X, w0, w1)
    return np.sqrt(np.mean((T - Y)**2))

Now we are ready to automate the guessing approach we attempted above. Let's define a search algorithm that

  • bumps a weight up and down to determine which direction decreases the error,
  • repeatedly shift that weight in that direction until the error increases, and
  • repeat these steps with the other weight, and
  • repeat all steps multiple times.

This search algorithm is sometimes called coordinate descent.

Coordinate Descent

In [ ]:
w0 = 0.4
w1 = 0.5
dw = 0.1   # How much to change a weight's value.

current_error = error(X, T, w0, w1)
up_error = error(X, T, w0 + dw, w1)
down_error = error(X, T, w0 - dw, w1)
if down_error < current_error:
    dw = -dw
    new_error = down_error
else:
    new_error = up_error
while new_error <= current_error:
    current_error = new_error
    w0 = w0 + dw
    new_error = error(X, T, w0, w1)
    print('w0 = {:.2f} new_error = {:.5f}'.format(w0, new_error))
In [ ]:
dw = 0.1
current_error = error(X, T, w0, w1)
up_error = error(X, T, w0, w1 + dw)
down_error = error(X, T, w0, w1 - dw)
if down_error < current_error:
    dw = -dw
    new_error = down_error
else:
    new_error = up_error
while new_error <= current_error:
    current_error = new_error
    w1 = w1 + dw
    new_error = error(X, T, w0, w1)
    print('w1 = {:.2f} new_error = {:.5f}'.format(w1, new_error))

Lot's of repeated code here. We don't want to copy and paste for each iteration. All steps are put together in the following function.

In [ ]:
# Need a function for this.  Let's optimize w_bias then w
def coordinate_descent(errorF, X, T, w0, w1, dw, nSteps):
    step = 0
    current_error = errorF(X, T, w0, w1)
    error_sequence = [current_error]
    W_sequence = [[w0, w1]]
    changed = False

    while step < nSteps:

        step += 1
        
        if not changed:
            dw = dw * 0.1
            
        changed = False
        
        # first vary w_bias
        up_error = errorF(X, T, w0 + dw, w1)
        down_error = errorF(X, T, w0 - dw, w1)
        if down_error < current_error:
            dw = -dw
        while True:
            new_w0 = w0 + dw
            new_error = errorF(X, T, new_w0, w1)
            if new_error >= current_error or step > nSteps:
                break
            changed = True
            w0 = new_w0
            W_sequence.append([w0, w1])
            error_sequence.append(new_error)
            current_error = new_error
            step += 1

        # now vary w
        up_error = errorF(X, T, w0, w1 + dw)
        down_error = errorF(X, T, w0, w1 - dw)
        if down_error < current_error:
            dw = -dw
        while True:
            new_w1 = w1 + dw
            new_error = errorF(X, T, w0, new_w1)
            if new_error >= current_error or step > nSteps:
                break
            changed = True
            w1 = new_w1
            W_sequence.append([w0, w1])
            error_sequence.append(new_error)
            current_error = new_error
            step += 1

    return w0, w1, error_sequence, W_sequence

We will need some functions to help us create plots showing the error going down and the sequence of weight values that were tried.

In [ ]:
def plot_sequence(error_sequence, W_sequence, label):
    plt.subplot(1, 2, 1)
    plt.plot(error_sequence, 'o-', label=label)
    plt.xlabel('Steps')
    plt.ylabel('Error')
    plt.legend()
    plt.subplot(1, 2, 2)
    W_sequence = np.array(W_sequence)
    plt.plot(W_sequence[:,0], W_sequence[:,1], '.-', label=label)
    plot_error_surface()

def plot_error_surface():
    n = 20
    w0s = np.linspace(-5, 5, n) 
    w1s = np.linspace(-0.5, 1.0, n) 
    w0s, w1s = np.meshgrid(w0s, w1s)
    surface = []
    for w0i in range(n):
        for w1i in range(n):
            surface.append(error(X, T, w0s[w0i, w1i], w1s[w0i, w1i]))
    plt.contourf(w0s, w1s, np.array(surface).reshape((n, n)), cmap='bone')
    plt.colorbar()
    plt.xlabel('w_bias')
    plt.ylabel('w')
    
def make_animation(model, error_sequence, W_sequence, X, T, label):
    W_sequence = np.array(W_sequence)
    fig = plt.figure(figsize=(15, 8))
    ax1 = plt.subplot(1, 3, 1)
    error_line, = ax1.plot([], [])  #range(len(error_sequence)), [np.nan] * len(error_sequence))
    ax1.set_xlim(0, len(error_sequence))
    ax1.set_ylim(0, max(error_sequence))

    ax2 = plt.subplot(1, 3, 2)
    plot_error_surface()
 
    w_line, = ax2.plot([], [], '.-', label=label)
    ax2.legend()
    # w_xdata, w_ydata = [], [][np.nan]*nw, [np.nan]*nw, '.-')

    ax3 = plt.subplot(1, 3, 3)
    ax3.plot(X, T, 'o')
    model_line, = ax3.plot([], [], 'r-', lw=3, alpha=0.5, label=label)
    ax3.set_xlim(0, 24)
    ax3.set_ylim(np.min(T), np.max(T))

    def animate(i):
        error_line.set_data(range(i), error_sequence[:i])
        w_line.set_data(W_sequence[:i, 0], W_sequence[:i, 1])
        Y = model(X, W_sequence[i, 0], W_sequence[i, 1])
        model_line.set_data(X, Y)
        return error_line, w_line, model_line
    
    return animation.FuncAnimation(fig, animate, frames=len(W_sequence), repeat=True, interval=50, blit=True)
In [ ]:
w0 = 0.5
w1 = 0.5
nSteps = 200
dw = 10
w0, w1, error_sequence, W_sequence = coordinate_descent(error, X, T, w0, w1, dw, nSteps)
print('Coordinate Descent: Error is {:.2f}   W is {:.2f}, {:.2f}'.format(error(X, T, w0, w1), w0, w1))
In [ ]:
plt.figure(figsize=(15,8))
plot_sequence(error_sequence, W_sequence, 'coord desc')
In [ ]:
anim = make_animation(linear_model, error_sequence, W_sequence, X, T, 'coord desc')
plt.close()
In [ ]:
anim

Okay, well that's fun, but this becomes kind of silly when we try to apply this to other models that have more weights, like thousands, or millions. Instead, we need a way to find a direction in which we can change both weights, meaning all weights, on each step.

Run and Twiddle

How about this? Take a step in some direction. If error decreases, continue in that direction. If error does not decrease, pick a random direction. Repeat.

Thsi has been called "run and twiddle", or "run and tumble". This Wikipedia page describes how single cell organisms use cilia in their cell membranes to provide locomotion, either in the current direction as they move in a coordinated fashion, or to cause a spin to change direction.

Now we will be changing w_bias and w together, so that we can step through the 2-d weight space in particular directions. Representing our two weights as a two-component vector, and rewriting some functions to accept a vector, simplifies the code a bit. We will actually represent the weights, W, as a column matrix of two components. It will look like

$$W =\begin{bmatrix} w_{bias} \\ w \end{bmatrix}$$

Let's add a couple of functions to code the application of our model to data, and to calculate the root-mean-square error that we wish to minimize. Then we will modify a bit the plotting functions to accept the name of the model function. We will use a different model towards the end of these notes.

In [ ]:
def linear_model(X, W):
    # W is column vector
    return X @ W[1:, :] + W[0,:]

def rmse(model, X, T, W):
    Y = model(X, W)
    return np.sqrt(np.mean((T - Y)**2))
In [ ]:
def plot_error_contourf(model):
    n = 20
    wbiass = np.linspace(-5, 5, n)  # history[:,0].min() * 0.5, history[:,0].max() * 2, n)
    ws = np.linspace(-0.5, 1.0, n)  # history[:,1].min() * 0.5, history[:,1].max() * 2, n)
    wbiass, ws = np.meshgrid(wbiass, ws)
    surface = []
    for wbi in range(n):
        for wi in range(n):
            W = np.array([wbiass[wbi, wi], ws[wbi, wi]]).reshape(-1, 1)
            surface.append(rmse(model, X, T, W))
    plt.contourf(wbiass, ws, np.array(surface).reshape((n, n)), cmap='bone')
    plt.colorbar()
    plt.xlabel('w_bias')
    plt.ylabel('w')
    
def plot_sequence(model, error_sequence, W_sequence, label):
    plt.subplot(1, 2, 1)
    plt.plot(error_sequence, 'o-', label=label)
    plt.xlabel('Steps')
    plt.ylabel('Error')
    plt.legend()
    plt.subplot(1, 2, 2)
    W_sequence = np.array(W_sequence)
    plt.plot(W_sequence[:,0], W_sequence[:,1], '.-', label=label)
    plot_error_contourf(model)
    
def make_animation(model, error_sequence, W_sequence, X, T, label):
    W_sequence = np.array(W_sequence)
    fig = plt.figure(figsize=(15,8))
    ax1 = plt.subplot(1, 3, 1)
    error_line, = ax1.plot([], [])  #range(len(error_sequence)), [np.nan] * len(error_sequence))
    ax1.set_xlim(0, len(error_sequence))
    ax1.set_ylim(0, max(error_sequence))

    ax2 = plt.subplot(1, 3, 2)
    plot_error_contourf(model)

    w_line, = ax2.plot([], [], '.-', label=label)
    ax2.legend()

    # w_xdata, w_ydata = [], [][np.nan]*nw, [np.nan]*nw, '.-')

    ax3 = plt.subplot(1, 3, 3)
    ax3.plot(X, T, 'o')
    # Y = linear_model(X, W)
    model_line, = ax3.plot([], [], 'r-', lw=3, alpha=0.5, label=label)
    ax3.set_xlim(0, 24)
    ax3.set_ylim(np.min(T), np.max(T))
    # ax3.legend()

    # def animate_init():
    #     ax1.set_xlim(0, len(W_sequence))
    #     ax1.set_ylim(0, max(error_sequence))

    def animate(i):
        error_line.set_data(range(i), error_sequence[:i])
        w_line.set_data(W_sequence[:i, 0], W_sequence[:i, 1])
        Y = model(X, W_sequence[i:i+1, :].T)  # i+1 keeps 2-d form of W matrix
        model_line.set_data(X, Y)
        return error_line, w_line, model_line

    return animation.FuncAnimation(fig, animate, frames=len(W_sequence), repeat=True, interval=50, blit=True)
In [ ]:
######################################################################

def vector_length(v):
    return np.sqrt(v.T @ v)

def run_and_twiddle(model_f, rmse_f, X, T, W, dW, nSteps):
    step = 0
    current_error = rmse_f(model_f, X, T, W)
    error_sequence = [current_error]
    W_sequence = [W.flatten()]
    nFails = 0
    
    while step < nSteps:
        # print(step)
        new_direction = np.random.uniform(-1, 1, size=(2, 1))
        # print(nFails, new_direction)
        new_direction = dW * new_direction / vector_length(new_direction)
        if nFails > 10:
            dW = dW * 0.8
        while step < nSteps:
            new_W = W.copy() + new_direction
            new_error = rmse_f(model_f, X, T, new_W)
            step += 1
            if new_error >= current_error:
                nFails += 1
                break
            nFails = 0
            # print('good', new_direction)
            W = new_W
            W_sequence.append(W.flatten())
            error_sequence.append(new_error)
            current_error = new_error

    return W, error_sequence, W_sequence
In [ ]:
w_bias = 0.5 # 10
w = 0.5
W = np.array([w_bias, w]).reshape(-1, 1)
nSteps = 400
dW = 10
W, error_sequence, W_sequence = run_and_twiddle(linear_model, rmse, X, T, W, dW, nSteps)
print('Run and Twiddle:  Error is {:.2f}   W is {:.2f}, {:.2f}'.format(rmse(linear_model, X, T, W), W[0,0], W[1,0]))
In [ ]:
plt.figure(figsize=(15,8))
plot_sequence(linear_model, error_sequence, W_sequence, 'run & twiddle')
In [ ]:
anim = make_animation(linear_model, error_sequence, W_sequence, X, T, 'run & twiddle')
plt.close()
In [ ]:
anim

Gradient Descent

Let's call the output of our model Y and the error being minimized E.

To perform gradient descent, we need $\frac{\partial E}{\partial W}$. Let's call this dEdW. The calculation of this can be divided into two factors, using the chain rule.

$$\begin{align*} \frac{\partial E}{\partial W} &= \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial W} \end{align*}$$

The error we want to minimize is the squared error, $(T - Y)^2$, and $Y =X W$, so

$$\begin{align*} \frac{\partial E}{\partial W} &= \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial W} \\ \frac{\partial E}{\partial W} &= \frac{\partial (T-Y)^2}{\partial Y} \frac{\partial X W}{\partial W} \\ \frac{\partial E}{\partial W} &= -2 (T - Y) X \end{align*}$$

In python, we have

dYdW = X
dEdY = -2 (T - Y)
dEdW = dEdY.T @ dYdW

with some other subtle things to allow us to include the bias weight $w_0$ in the calculations.

In [ ]:
#def linear_model(X, W):
#    # W is column vector
#    return X @ W[1:, :] + W[0,:]
In [ ]:
def dYdW(X, T, W):
    # One row per sample in X,T.  One column per W component.
    # For first one, is constant 1.
    # For second one, is value of X
    return np.insert(X, 0, 1, axis=1)

def dEdY(X, T, W):
    Y = linear_model(X, W)
    return -2 * (T - Y)
    
def dEdW(X, T, W):
    result = dEdY(X, T, W).T @ dYdW(X, T, W) / (X.shape[0])
    return result.T
In [ ]:
def gradient_descent(model_f, gradient_f, rmse_f, X, T, W, rho, nSteps):
    error_sequence = []
    W_sequence = []
    for step in range(nSteps):
        error_sequence.append(rmse_f(model_f, X, T, W))
        W_sequence.append(W.flatten())
        
        W -= rho * gradient_f(X, T, W)  # HERE IS THE ALGORITHM!!
        
    return W, error_sequence, W_sequence
In [ ]:
w_bias = 0.5 # 10
w = 0.5
W = np.array([w_bias, w]).reshape(-1, 1)
nSteps = 200
rho = 0.0002
W, error_sequence, W_sequence = gradient_descent(linear_model, dEdW, rmse, X, T, W, rho, nSteps)
print('Gradient Descent:  Error is {:.2f}   W is {}'.format(rmse(linear_model, X, T, W), W))
In [ ]:
plt.figure(figsize=(15,8))
plot_sequence(linear_model, error_sequence, W_sequence, 'gradient descent')
In [ ]:
anim = make_animation(linear_model, error_sequence, W_sequence, X, T, 'gradient descent')
plt.close()
In [ ]:
anim

Now let's try a recently developed variation of the gradient descent method, called Adam, for adaptive moment estimation. See ADAM: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Lei Ba.

In [ ]:
def gradient_descent_adam(model_f, gradient_f, rmse_f, X, T, W, rho, nSteps):
    # Commonly used parameter values
    alpha = rho
    beta1 = 0.9
    beta2 = 0.999
    epsilon = 1e-8
    m = 0
    v = 0
    
    error_sequence = []
    W_sequence = []
    for step in range(nSteps):
        error_sequence.append(rmse_f(model_f, X, T, W))
        W_sequence.append(W.flatten())
        
        g = gradient_f(X, T, W)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g * g
        mhat = m / (1 - beta1 ** (step+1))
        vhat = v / (1 - beta2 ** (step+1))
        W -= alpha * mhat / (np.sqrt(vhat) + epsilon)
        
    return W, error_sequence, W_sequence
In [ ]:
w_bias = 0.5 # 10
w = 0.5
W = np.array([w_bias, w]).reshape(-1, 1)
rho = 0.01
nSteps = 200
W, error_sequence, W_sequence = gradient_descent_adam(linear_model, dEdW, rmse, X, T, W, rho, nSteps)
print('Adam:  Error is {:.2f}   W is {:.2f}, {:.2f}'.format(rmse(linear_model, X, T, W), W[0,0], W[1,0]))
In [ ]:
plt.figure(figsize=(15,8))
plot_sequence(linear_model, error_sequence, W_sequence, 'gradient descent')
In [ ]:
anim = make_animation(linear_model, error_sequence, W_sequence, X, T, 'gradient descent adam')
plt.close()
In [ ]:
anim

Let's Try a Slightly Nonlinear Model

Let's make a quadratic model now. The equation for the output of this model for the $i^{th}$ sample is

$$ y_i = w_0 + w_1 x_i + w_2 x_i^2$$
In [ ]:
def nonlinear_model(X, W):
    # W is column vector
    linear_part = X @ W[1:2, :] + W[0,:]
    nonlinear_part = X**2 @ W[2:3, :]
    return nonlinear_part + linear_part
In [ ]:
def dYdW(X, T, W):
    # One row per sample in X,T.  One column per W component.
    # For first one, is constant 1.
    # For second one, is value of X
    linear_part = np.insert(X, 0, 1, axis=1)
    nonlinear_part = X**2
    return np.hstack((linear_part, nonlinear_part))  # column vector

def dEdY(X, T, W):
    Y = nonlinear_model(X, W)
    return -2 * (T - Y)
    
# dEdW from before does not need to be changed.
In [ ]:
w_bias = 0.5 # 10
w = 0.5
nonlinear_w = 0.1
W = np.array([w_bias, w, nonlinear_w]).reshape(-1, 1)
rho = 0.01
nSteps = 400
W, error_sequence, W_sequence = gradient_descent_adam(nonlinear_model, dEdW, rmse, X, T, W, rho, nSteps)
print('Adam:  Error is {:.2f}   W is {:.2f}, {:.2f} {:.2f}'.format(rmse(nonlinear_model, X, T, W), W[0,0], W[1,0], W[2,0]))
In [ ]:
plt.figure(figsize=(10,8))
plt.plot(X, T, '.', label='Training Data')

plt.plot(X, nonlinear_model(X, W), 'ro', label='Prediction on Training Data')

plt.xlabel(Xnames[0])
plt.ylabel(Tnames[0])

plt.legend();

How would you change the previous code cell to plot a continuous red line?

In [ ]: