Linear Models for Predicting Continuous Values

Linear Structure

$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Given $N$ observations, $\xv_n$, for $n=1,\ldots,N$, and target values, $t_n$, for $n=1,\ldots,N$, what is the simplest model, $g(\xv)$, you can think of?

$$ g(\xv) = 0 $$

or maybe

$$ g(\xv) = c $$

What is next simplest model?

$$ \begin{align*} g(\xv;\wv) &= w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D \\ &= w_0 + \sum_{i=1}^D w_i x_i \\ & = \sum_{i=0}^D w_i x_i \mbox{, where } x_0 = 1 \\ &= \wv^T \xv \end{align*} $$
  • This is nice because it is linear in the parameters $\wv$; optimizations based on derivatives might be solvable analytically.
  • This is not so nice, because it is also linear in the inputs, $\xv$; greatly limits the complexity of the model.
  • But, a model linear in the inputs might be the best you can do for many cases, such as a sparsely sampled distribution, process, population, thing...whatever it is you want to model.

Fitting Data Samples with a Linear Model

Springs

Force exerted by spring is proportional to its length. The potential energy stored in the spring is proportinal to the square of its length. Say we want the rod to settle at position that minimizes the potential energy in the springs. $$ \begin{align*} \sum_{n=1}^N (t_n - g(\xv_n;\wv))^2 \end{align*} $$

If $g$ is an affine (linear + constant) function of $x$, $$ g(\xv;\wv) = w_0 + w_1 x $$ with parameters $\wv = (w_0, w_1)$, which parameter values give best fit? $$ \wv_{\mbox{best}} = \argmin{\wv} \sum_{n=1}^N (t_n - g(x_n ; \wv))^2 $$

Set derivative (gradient) with respect to $\wv$ to zero and solve for $\wv$. Let's do this with matrices.

Collect all targets into matrix $T$ and $x$ samples into matrix $X$. ($N$=number samples, $D$=sample dimensionality) $$ \begin{align*} T &= \begin{bmatrix} t_1 \\ t_2 \\ \vdots \\ t_N \end{bmatrix} \\ X &= \begin{bmatrix} x_{1,0} & x_{1,1} & x_{1,2} & \dotsc & x_{1,D} \\ x_{2,0} & x_{2,1} & x_{2,2} & \dotsc & x_{2,D} \\ \vdots \\ x_{N,0} & x_{N,1} & x_{N,2} & \dotsc & x_{N,D} \end{bmatrix}\\ \wv &= \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_D \end{bmatrix} \end{align*} $$

Collection of all differences is $T - X\wv$, which is an $N \times 1$ matrix. To form the square of all values and add them up, just do a dot product $(T-X\wv)^T (T-X\wv)$. This only works if the value we are predicting is a scalar, which means $T$ is a column matrix. If we want to predict more than one value for each sample, $T$ will have more than one column. Let's continue with the derivation assuming $T$ has $K$ columns, meaning we want a linear model with $K$ outputs.

To find the best value for $\wv$, we take the derivative the the sum of squared error objective, set it equal to 0 and solve for $\wv$.

$$ \begin{align*} \frac{\partial \sum_{n=1}^N (\tv_n - g(\xv_n;\wv))^2}{\partial \wv} &= -2 \sum_{n=1}^N (\tv_n - g(\xv_n;\wv) \frac{\partial g(\xv_n;\wv)}{\partial \wv}\\ &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \frac{\partial \xv_n^T \wv}{\partial \wv}\\ &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \xv_n^T \end{align*} $$

Here's where we get the benefit of expressing the $\xv_n$ and $t_n$ samples as matrices. The sum can be performed with a dot product:

$$ \begin{align*} \frac{\partial \sum_{n=1}^N (\tv_n - g(\xv_n;\wv))^2}{\partial \wv} &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \xv_n^T\\ &= -2 \Xv^T (\Tv - \Xv \wv) \end{align*} $$

Check the sizes and shapes of each matrix in the last equation above.

Now we can set this equal to zero and solve for $\wv$.

$$ \begin{align*} -2 \Xv^T (\Tv - \Xv \wv) &= 0\\ \Xv^T (\Tv - \Xv \wv) &= 0\\ \Xv^T \Tv &= \Xv^T \Xv \wv\\ \wv &= (\Xv^T \Xv)^{-1} \Xv^T \Tv \end{align*} $$

And, in python

w = np.linalg.inv(X.T @ X), X.T @ T)

or, you may use the solve function that assumes $\Xv^T \Xv$ is full rank (no linearly dependent columns),

w = np.linalg.solve(X.T @ X, X.T @ T)

or, better yet, use the lstsq function that does not make that assumption.

w = np.linalg.lstsq(X.T @ X, X.T @ T))

The lstsq and solve functions can be written with simpler arguments, like

w = np.linalg.lstsq(X, T))

because they are designed to find the value of $\wv$ that minimized the squared error between $X \wv$ and $T$. Using the matrix products as arguments will simplify our implementation of ridge regression a little bit later in this course. For now, let's use the simpler version.

Application

Try the automobile data set at UCI Machine Learning Database Repository. Download auto-mpg.data and auto-mpg.names from the "Data Folder" link near the top of the web page.

In [ ]:
!curl -O http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
!curl -O http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

First, take a look at auto-mpg.names. There you will learn that there are 398 samples, each with 8 numerical attributes and one string attribute. Their names are

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

First step, read it into python and look at it.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Take a look at a few lines of the data file to figure out how to read it in.

In [ ]:
!head -40 auto-mpg.data

Instead of relying on the pandas package to read this data, let's try using the simpler numpy.loadtxt function. We see that each line contains a sample consisting of 8 numbers and one string. However, first we have to figure out how to deal with the missing values. The documentation for numpy.loadtxt reveals that the converters argument can be used to specify a function to apply to the values from particular columns of data. We can use this to deal with the missing values. Looking at auto-mpg.data we see that the missing values are marked with question marks. These are all in the fourth column, the horsepower attribute.

Here is a simple function that translates '?' to NaN (np.nan) and anything else is converted to a floating point number.

In [ ]:
def missingIsNan(s):
    if s == b'?':  # single character read as b'?' != '?'
        return np.nan
    else:
        return float(s)
    
print(missingIsNan('12.32'))
print(missingIsNan(b'?'))

Hey, why not get fancy and write a one-liner?

In [ ]:
def missingIsNan(s):
    return np.nan if s == b'?' else float(s)

print(missingIsNan('12.32'))
print(missingIsNan(b'?'))

The converters argument to np.loadtxt accepts a python dictionary, which is python's associative array structure.

In [ ]:
d = {1: 'a', 2: 'yo', 3: 42, 4: missingIsNan}
d
In [ ]:
d[1]
In [ ]:
d[4]
In [ ]:
d[4](b'?')

Let's also restrict np.loadtxt to reading just the first 8 columns to avoid dealing with the string in the 9th column.

In [ ]:
data = np.loadtxt('auto-mpg.data', usecols=range(8), converters={3: missingIsNan})
In [ ]:
data.shape
In [ ]:
data[:3,:]

We can change the precision that ipython uses to display floating point values.

In [ ]:
%precision 3
In [ ]:
data[:3,:]

As we have done before, we must find the missing values. Let's just remove the samples with missing values.

In [ ]:
np.isnan(data)
In [ ]:
np.sum(np.isnan(data))
In [ ]:
np.sum(np.isnan(data), axis=0)

What does this result tell us?

Yep, all 6 NaN values are in the fourth column, the horsepower column.

Let's just remove those 6 samples from the array. To do this build a boolean mask of length equal to the number of rows in data that is False except for any row that contains a NaN. We can use the np.any or np.all methods to combine boolean values in each row or column. First, always try such ideas in multiple steps, checking each step.

In [ ]:
nans = np.isnan(data)
nans.shape
In [ ]:
nans.any(axis=0).shape
In [ ]:
nans.any(axis=1).shape

That's the one we want, a boolean value for each row, or sample.

In [ ]:
goodRowsMask = nans.any(axis=1)
goodRowsMask
In [ ]:
dataNew = data[goodRowsMask,:]
dataNew.shape

Wait a minute! This gives us only 6 samples. We wanted all samples but these 6, right?

In [ ]:
dataNew

So, let's change all False values to True and True values to False.

In [ ]:
goodRowsMask = np.logical_not(goodRowsMask)
In [ ]:
dataNew = data[goodRowsMask,:]
dataNew.shape
In [ ]:
dataNew[:3,:]
In [ ]:
np.sum(np.isnan(dataNew),axis=0)

Remember, the next step after reading data into python is to visualize it. One thing we can do is just plot the value of each attribute in a separate graph. Let's make an array of column names to label the y axes.

In [ ]:
names =  ['mpg','cylinders','displacement','horsepower','weight',
          'acceleration','year','origin']
In [ ]:
plt.figure(figsize=(10, 10))
nrow, ncol = dataNew.shape
for c in range(ncol):
    plt.subplot(3, 3, c+1)
    plt.plot(dataNew[:, c])
    plt.ylabel(names[c])

What is interesting to you in these graphs?

Let's try to predict miles per gallon, mpg, from the other attributes. To set this up, let's make a 392 x 1 column vector, T, of target values containing all of the mpg values, and a 392 x 7 matrix, X, to hold the inputs to our model.

In [ ]:
names
In [ ]:
T = dataNew[:, 0:1]  # dataNew[:,0] results in a one-dimensional matrix, dataNew[:,0:1] preserves its two-dimensional nature.
In [ ]:
X = dataNew[:, 1:]
In [ ]:
X.shape, T.shape
In [ ]:
Xnames = names[1:]
Tname = names[0]
Xnames,Tname

Now, let's see if a linear model makes some sense by plotting the target values versus each of the input variables.

In [ ]:
plt.figure(figsize=(10, 10))
for c in range(X.shape[1]):
    plt.subplot(3,3, c+1)
    plt.plot(X[:, c], T, 'o', alpha=0.5)
    plt.ylabel(Tname)
    plt.xlabel(Xnames[c])

What do you think? Are there any linear relationships between the individual input variables and the target variable? Do they make sense, given your knowledge of automobiles?

Now, to build the linear model. First, let's tack on a column of constant 1 values to the left side of X. With this addition, our matrix multiplication takes care of the $w_0$ coefficient as described above. Here are two ways to add the column of 1's. First, we can use np.hstack.

In [ ]:
X1 = np.hstack((np.ones((X.shape[0], 1)), X))
X.shape, X1.shape
In [ ]:
X1[:3, :]

We can also use np.insert whose arguments are np.insert(array, where, value, axis), so to add 1's as a column at index 0, we can do this.

In [ ]:
X1 = np.insert(X, 0, 1, 1)
X.shape, X1.shape

And, let's add a name to Xnames.

In [ ]:
Xnames.insert(0, 'bias')
Xnames

We could try to fit a linear model to all of the data and check to see how accurately we predict mpg for each sample. However, this will give a much too optimistic expectation of how well the model will do on new data.

A much better way to evaluate a model is to remove, or hold out, some of the samples from the data set used to fit the model. Then apply the model to the held out samples and compare the predicted mpg with the actual mpg. Call these held out samples the test set and the other samples used to fit the model the train set.

How should we partition the data into these two disjoint subsets? A common practice is to randomly select 80% of the samples as training samples and use the remaining 20% of the samples as testing samples.

To partition our samples mpg (rows of X and T ) into training and test sets, let's deal with just the row indices. Then we will use the same selected row indices to slice out corresponding rows of X and T.

In [ ]:
nrows = X1.shape[0]
nTrain = int(round(nrow*0.8))
nTest = nrow - nTrain
nTrain, nTest, nTrain+nTest
In [ ]:
rows = np.arange(nrows)
np.random.shuffle(rows)
rows
In [ ]:
trainIndices = rows[:nTrain]
testIndices = rows[nTrain:]
trainIndices, testIndices

Check that the training and testing sets are disjoint.

In [ ]:
np.intersect1d(trainIndices, testIndices)
In [ ]:
Xtrain = X1[trainIndices, :]
Ttrain = T[trainIndices, :]
Xtest = X1[testIndices, :]
Ttest = T[testIndices, :]
Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape

We want to solve for $\wv$ in the equation $X^T X \wv = X^T T$. This is done with the numpy.linalg.lstsq function. Don't be confused by the Ts. Remember something.T is the transpose of something.

In [ ]:
w = np.linalg.lstsq(Xtrain.T @ Xtrain, Xtrain.T @ Ttrain, rcond=None)
w = w[0] # to only keep the weights, and discard other information returned by lstsq
w
In [ ]:
for wi,name in zip(w.flat, Xnames):
    print('{:8.3f}  {:s}'.format(wi, name))

How can you figure out what np.linalg.lstsq is doing? Try finding the source code!

In [ ]:
!locate linalg.py

In my version I see documentation for lstsq that states

Return the least-squares solution to a linear matrix equation.

Solves the equation `a x = b` by computing a vector `x` that
minimizes the Euclidean 2-norm `|| b - a x ||^2`.  The equation may
be under-, well-, or over- determined (i.e., the number of
linearly independent rows of `a` can be less than, equal to, or
greater than its number of linearly independent columns).  If `a`
is square and of full rank, then `x` (but for round-off error) is
the "exact" solution of the equation.

Now that we have a linear model, which is simply $w$, let's use it to predict mpg for the first four samples.

In [ ]:
predict = X1[:4,:] @ w
predict

How do these predictions compare with the actual mpg values? We can either make a two column matrix, or use a for loop to print them.

In [ ]:
np.hstack(( predict, Ttrain[:4, :]))
In [ ]:
print('{:^5} {:^5}'.format('P', 'T'))
for (p, t) in zip(predict, Ttrain[0:4, :]):
    # print(p,t)
    print('{:5.2f} {:5.2f}'.format(p[0], t[0]))

Let's try all of the test data and plot the results.

In [ ]:
predict = Xtest @ w
predict.shape, Ttest.shape
In [ ]:
plt.plot(predict, Ttest, 'o')
plt.xlabel('Predicted MPG')
plt.ylabel('Actual MPG')
# add a 45 degree line
a = max(min(predict), min(Ttest))
b = min(max(predict), max(Ttest))
plt.plot([a, b], [a ,b], 'r', linewidth=3, alpha=0.7);

Not too shabby! But, how about a numerical measure of accuracy?

Right, just calculate the root-mean-square error (RMSE).

In [ ]:
np.sqrt( np.mean( (Xtest @ w - Ttest)**2))

This means we are about 3 mpg off in our predictions, on average.

Importance of Attributes (Input Variables)

Here again are the weights we learned before for predicting mpg. Can we use these to decide which attibutes are most important? Can we remove some from the model?

In [ ]:
for wi, name in zip(w.flat, Xnames):
    print('{:8.3f}  {:s}'.format(wi, name))

Perhaps year and origin are the most significant. Does this make sense?

A weight magnitude is determined not only by the linear relationship between the corresponding input variable with the target, but also by the variable's range.

What are the ranges of the input variables?

In [ ]:
for n, mn, mx in zip(Xnames, np.min(X1, axis=0), np.max(X1, axis=0)):
    print('{:>20} {:8.2f} {:8.2f}'.format(n, mn, mx))

Remember that you can also call numpy functions as methods applied to numpy array objects.

In [ ]:
for n, mn, mx in zip(Xnames, X1.min(axis=0), X1.max(axis=0)):
    print('{:>20} {:8.2f} {:8.2f}'.format(n, mn, mx))

The weight for weight is the smallest magnitude, but the range of its values are the largest.

The weight for origin is the largest, but the range of its values are the smallest.

If we could remove the effect of the ranges on the weight magnitudes, we could use the weight magnitudes to judge the relative importance of each input variable. How?

A common approach is to standardize the input variables by adjusting the values to have zero mean and unit variance.

In [ ]:
Xs = (X - X.mean(axis=0)) / X.std(axis=0)
Xs.shape
In [ ]:
Xs.mean(0)
In [ ]:
Xs.std(0)

To do this correctly when partitioning data into training and testing sets, we must always calculate means and standard deviations using only the training set, and use the same means and standard deviations when standardizing the testing set. Remember, you must not use any information about the testing set when building a model. If you do, your test error will be lower than it will be when you truly see new data.

You can do this by storing the means and standard deviations obtained from the training data, and just use them when standardizing the testing data. Don't forget to add the column of 1's after standardizing.

In [ ]:
means = X.mean(0)
stds = X.std(0)
Xs = (X - means) / stds
Xs1 = np.insert(Xs, 0, 1, 1)               
w = np.linalg.lstsq( Xs1.T @ Xs1, Xs1.T @ T, rcond=None)[0]   # What is the [0] doing here??
w

Another way is to construct functions for standardizing that include the calculated means and standard deviations as local variables, by using function closures.

In [ ]:
def makeStandardize(X):
    means = X.mean(axis=0)
    stds = X.std(axis=0)
    
    def standardize(origX):
        return (origX - means) / stds
    
    def unStandardize(stdX):
        return stds * stdX + means
    
    return (standardize, unStandardize)

Let's start with X again, and tack on the column of 1's after dividing data into training and testing partitions. What happens if you tack on the constant column of 1's before standardizing?

In [ ]:
Xtrain = X[trainIndices, :]
Ttrain = T[trainIndices, :]
Xtest = X[testIndices, :]
Ttest = T[testIndices, :]

(standardize, unStandardize) = makeStandardize(Xtrain)

XtrainS = standardize(Xtrain)
XtestS = standardize(Xtest)

XtrainS.mean(0), XtrainS.std(0), XtestS.mean(0), XtestS.std(0)

Notice that the means and standard deviations for the testing set are not as close to 0 and 1 as they are for the training set. Why?

Now we can tack on the column of 1's, solve for w, and examine the weight magnitudes.

In [ ]:
XtrainS1 = np.insert(XtrainS, 0, 1, 1)
XtestS1 = np.insert(XtestS, 0, 1, 1)
w = np.linalg.lstsq( XtrainS1.T @ XtrainS1, XtrainS1.T @ Ttrain, rcond=None)[0]  # see this [0]?
for wi, name in zip(w.flat, Xnames):
    print('{:8.3f}  {:s}'.format(wi, name))

Now what do you observe about the relative magnitudes? If you had a ton of input variables, it would be easier to see if we sorted them by their magnitudes.

In [ ]:
np.abs(w)
In [ ]:
np.argsort(np.abs(w.flat))
In [ ]:
np.argsort(np.abs(w.flat))[::-1]
In [ ]:
sortedOrder = np.argsort(np.abs(w.flat))[::-1]
Xnames = np.array(Xnames)
for wi, name in zip(w.flat[sortedOrder], Xnames[sortedOrder]):
    print('{:8.3f}  {:s}'.format(wi, name))

Multiple Target Components

The beauty of matrix equations now shines. Previously $T$ was an $N \times 1$ matrix of target values, one per sample.

Just add additional columns of target values for each target component. So $T$ becomes an $N \times K$ matrix, if there are $K$ output components. Each row is one $K$-dimensional sample.

Looking again at the solution for $\wv$,

$$ \begin{align*} \wv &= (X^T X)^{-1} X^T T \end{align*} $$

we see that this changes $\wv$ from its original form of $D \times 1$ to $D \times K$.

Let's use this to predict miles per gallon and horsepower.

First, let's assemble the data for predicting MPG and HP.

In [ ]:
X = dataNew[:, [1, 2, 4, 5, 6, 7]]
T = dataNew[:, [0, 3]] 
Tnames = [names[0], names[3]]
X.shape, Xnames, T.shape, Tnames
In [ ]:
Xtrain = X[trainIndices, :]
Ttrain = T[trainIndices, :]
Xtest = X[testIndices, :]
Ttest = T[testIndices, :]
Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape
In [ ]:
standardize,_ = makeStandardize(Xtrain)
XtrainS = standardize(Xtrain)
XtestS = standardize(Xtest)
XtrainS1 = np.insert(XtrainS, 0, 1, 1)
Xnames = np.array(['bias']  +names)[[0, 2, 3, 5, 6, 7, 8]]
Xnames
In [ ]:
XtrainS1.shape, Ttrain.shape
In [ ]:
w = np.linalg.lstsq( XtrainS1.T @ XtrainS1, XtrainS1.T @ Ttrain, rcond=None)[0]
w
In [ ]:
Xnames = np.array(Xnames)
for targeti in range(2):
    print('\n Weights for {} target\n'.format(Tnames[targeti]))
    thisw = w[:, targeti]
    sortedOrder = np.argsort(np.abs(thisw))[::-1]
    for wi ,name in zip(thisw[sortedOrder], Xnames[sortedOrder]):
        print('{:8.3f}  {:s}'.format(wi, name))

Now let's use our single model to predict both mpg and horsepower.

In [ ]:
XtestS1 = np.insert(XtestS, 0, 1, 1)
prediction = XtestS1 @ w
prediction.shape

Or do the insertion of a column of 1's and the multiplication by the weight matrix in one statement.

In [ ]:
prediction = np.insert(XtestS, 0, 1, 1) @ w
prediction.shape
In [ ]:
plt.figure(figsize=(10,10))
for p in range(2):
    plt.subplot(2, 1, p+1)
    plt.plot(prediction[:, p], Ttest[:, p], 'o')
    plt.xlabel("Predicted " + Tnames[p])
    plt.ylabel("Actual " + Tnames[p])
    a = max(min(prediction[:, p]), min(Ttest[:, p]))
    b = min(max(prediction[:, p]), max(Ttest[:, p]))
    plt.plot([a, b], [a, b], 'r', linewidth=3)

How well did we do in terms of RMSE?

In [ ]:
rmseTrain = np.sqrt(np.mean((XtrainS1 @ w - Ttrain)**2, axis=0))
rmseTrain
In [ ]:
rmseTest = np.sqrt(np.mean((XtestS1 @ w - Ttest)**2, axis=0))
rmseTest
In [ ]:
print('Training RMSE: MPG {:4.2f} HP {:4.2f}'.format(*rmseTrain))   #what is that * doing there??
print(' Testing RMSE: MPG {:4.2f} HP {:4.2f}'.format(*rmseTest))
In [ ]: