$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}\;} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}\;}$

# Linear Ridge Regression¶

Remember when we were discussing the complexity of models? The simplest model was a constant. A simple way to predict rainfall is to ignore all measurements and just predict the average rainfall. If a linear model of measurements may do no better. The degree to which it does do better can be expressed in the relative sum of squared errors (RSE) or

$$RSE = \frac{\sum_{i=1}^N (\tv_i - \xv_i^T \wv)^2}{\sum_{i=1}^N (\tv_i - \bar{\Tv})^2}$$

If RSE is 1, then your linear model is no better than using the mean. The closer RSE is to 0, the better your linear model is.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
!curl -O http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
!curl -O http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100 30286  100 30286    0     0   142k      0 --:--:-- --:--:-- --:--:--  142k
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
100  1660  100  1660    0     0  15961      0 --:--:-- --:--:-- --:--:-- 15961

In [3]:
def makeMPGData(filename='auto-mpg.data'):
def missingIsNan(s):
return np.nan if s == b'?' else float(s)
data = np.loadtxt(filename, usecols=range(8), converters={3: missingIsNan})
goodRowsMask = np.isnan(data).sum(axis=1) == 0
print("After removing rows containing question marks, data has",data.shape[0],"rows and",data.shape[1],"columns.")
X = data[:,1:]
T = data[:,0:1]
Xnames =  ['bias', 'cylinders','displacement','horsepower','weight','acceleration','year','origin']
Tname = 'mpg'
return X,T,Xnames,Tname

In [4]:
X,T,Xnames,Tname = makeMPGData()

Read 398 rows and 8 columns from auto-mpg.data
After removing rows containing question marks, data has 392 rows and 8 columns.

In [6]:
means = X.mean(0)
stds = X.std(0)
nRows = X.shape[0]
Xs1 = np.insert((X-means)/stds, 0, 1, axis=1) # insert column of ones in new 0th column
# Xs1 = np.hstack(( np.ones((nRows,1)), (X-means)/stds ))
w = np.linalg.lstsq(Xs1.T @ Xs1, Xs1.T @ T, rcond=None)[0]
w

Out[6]:
array([[23.44591837],
[-0.84051891],
[ 2.07930256],
[-0.65163644],
[-5.49205044],
[ 0.22201407],
[ 2.76211888],
[ 1.14731588]])
In [7]:
# predict = Xs1.dot(w)
predict = Xs1 @ w
RSE = np.sum((T-predict)**2) / np.sum((T - T.mean(0))**2)
RSE

Out[7]:
0.17852192351894017

So, our linear model seems to be quite a bit better than using just the mean mpg. If our linear model is worse than using the mean, this value would be greater than 1.

Well, maybe our model would do better if it was a little closer to just the mean, or, equivalently, just using the bias weight. This is the question that drives the derivation of "ridge regression". Let's add a term to our sum of squared error objective function that is the sum of all weight magnitudes except the bias weight. Then, we not only minimize the sum of squared errors, we also minimize the sum of the weight magnitudes:

$$\sum_{i=1}^N (\tv_i - \xv_i^T \wv)^2 + \lambda \sum_{i=2}^N w_i^2$$

Notice that $\lambda$ in there. With $\lambda=0$ we have our usual linear regression objective function. With $\lambda>0$, we are adding in a penalty for the weight magnitudes.

How does this change our solution for the best $\wv$? You work it out. You should get

$$\wv = (X^T X + \lambda I)^{-1} X^T T$$

except instead of using

$$\lambda I = \begin{bmatrix} \lambda & 0 & \dotsc & 0\\ 0 & \lambda & \dotsc & 0\\ \vdots \\ 0 & 0 & \dotsc & \lambda \end{bmatrix}$$

we want

$$\begin{bmatrix} 0 & 0 & \dotsc & 0\\ 0 & \lambda & \dotsc & 0\\ \vdots \\ 0 & 0 & \dotsc & \lambda \end{bmatrix}$$

so we don't penalize the bias weight, the weight multiplying the constant 1 input component.

Now, what value should $\lambda$ be? Must determine empirically, by calculating the sum of squared error on test data.

Actually, we should not find the best value of $\lambda$ by comparing error on the test data. This will give us a too optimistic prediction of error on novel data, because the test data was used to pick the best $\lambda$. We really must hold out another partition of data from the training set for this. This third partition is often called the model validation set. So, we partition our data into disjoint training, validation, and testing subsets, and

• For each value of $\lambda$
• Solve for $\wv$ using the training set
• Use $\wv$ to predict the output for the validation set and calculate the squared error.
• Use $\wv$ to predict the output for the testing set and calculate the squared error.
• Pick value of $\lambda$ that produced the lowest validation set error, and report the testing set error obtained using that value of $\lambda$.

Let's do it!

In [8]:
lamb = 0.1
D = Xs1.shape[1]
lambdaDiag = np.eye(D) * lamb

Out[8]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0.1, 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.1, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.1, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0.1, 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.1, 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.1]])
In [9]:
def makeLambda(D,lamb=0):
lambdaDiag = np.eye(D) * lamb
makeLambda(3,0.2)

Out[9]:
array([[0. , 0. , 0. ],
[0. , 0.2, 0. ],
[0. , 0. , 0.2]])
In [11]:
w = np.linalg.lstsq(Xs1.T @ Xs1 + lambdaDiag, Xs1.T @ T, rcond=None)[0]
w

Out[11]:
array([[23.44591837],
[-0.83453746],
[ 2.05581595],
[-0.6544685 ],
[-5.474439  ],
[ 0.21836873],
[ 2.76033842],
[ 1.14610564]])
In [12]:
%precision 2

Out[12]:
'%.2f'
In [14]:
D = Xs1.shape[1]
wLambda10 = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,10.0), Xs1.T @ T, rcond=None)[0]
wLambda0 = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,0), Xs1.T @ T, rcond=None)[0]
np.hstack((wLambda10,wLambda0))

Out[14]:
array([[ 2.34e+01,  2.34e+01],
[-6.06e-01, -8.41e-01],
[ 7.05e-01,  2.08e+00],
[-8.74e-01, -6.52e-01],
[-4.30e+00, -5.49e+00],
[-7.00e-03,  2.22e-01],
[ 2.62e+00,  2.76e+00],
[ 1.08e+00,  1.15e+00]])
In [15]:
lambdas = [0,0.1,1,10,100]
for lamb in lambdas:
w = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,lamb), Xs1.T @ T, rcond=None)[0]
plt.plot(w)

In [16]:
lambdas = [0,0.1,1,10,100]
for lamb in lambdas:
w = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,lamb), Xs1.T @ T, rcond=None)[0]
plt.plot(w, '-o')
plt.xticks(range(8), Xnames, rotation=30, horizontalalignment='right')
plt.ylabel('$\mathbf{w}$', fontsize='large', rotation='horizontal', labelpad=20)
plt.legend(lambdas);


Since we are not modifying the bias weight, let's take it out of the plot.

In [17]:
lambdas = [0,0.1,1,10,100]
for lamb in lambdas:
w = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,lamb), Xs1.T @ T, rcond=None)[0]
plt.plot(w[1:], '-o')
plt.xticks(range(7), Xnames[1:], rotation=30, horizontalalignment='right')
plt.ylabel('$\mathbf{w}$', fontsize='large', rotation='horizontal', labelpad=20)
plt.legend(lambdas);

In [18]:
plt.figure(figsize=(10, 10))
lambdas = [0, 10, 100, 1000, 10000]
for lamb in lambdas:
w = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,lamb), Xs1.T @ T, rcond=None)[0]
plt.plot(Xs1[:30] @ w)
plt.plot(T[:30],'ro',lw=5,alpha=0.8)
plt.ylabel('Model Output')
plt.xlabel('Sample Index')
plt.legend(lambdas,loc='best')

Out[18]:
<matplotlib.legend.Legend at 0x7feb5b689c18>

Which $\lambda$ value is best? Careful. What is the best value of $\lambda$ if just comparing error on training data?

Now, careful again! Can't report expected error from testing data that is also used to pick best value of $\lambda$. Error is likely to be better than what you will get on new data that was not used to train the model and also was not used to pick value of $\lambda$.

Need a way to partition our data into training, validation and testing subsets. Let's write a function that makes these partitions randomly, given the data matrix and the fractions to be used in the three partitions.

In [20]:
def partition(X, T, trainFraction=0.8, validateFraction=0.1, testFraction=0.1):
'''Usage: Xtrain,Ttrain,Xval,Tval,Xtest,Ttext = partition(X,T,0.8,0.2,0.2)'''
if trainFraction + validateFraction + testFraction != 1:
raise ValueError("Train, validate and test fractions must sum to 1. Given values sum to " + str(trainFraction+validateFraction+testFraction))
n = X.shape[0]
nTrain = round(trainFraction * n)
nValidate = round(validateFraction * n)
nTest = round(testFraction * n)
if nTrain + nValidate + nTest != n:
nTest = n - nTrain - nValidate
# Random order of data matrix row indices
rowIndices = np.arange(X.shape[0])
np.random.shuffle(rowIndices)
# Build X and T matrices by selecting corresponding rows for each partition
Xtrain = X[rowIndices[:nTrain], :]
Ttrain = T[rowIndices[:nTrain], :]
Xvalidate = X[rowIndices[nTrain:nTrain + nValidate], :]
Tvalidate = T[rowIndices[nTrain:nTrain + nValidate], :]
Xtest = X[rowIndices[nTrain+nValidate:nTrain + nValidate + nTest], :]
Ttest = T[rowIndices[nTrain+nValidate:nTrain + nValidate + nTest], :]
return Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest

In [21]:
X = np.arange(20).reshape((10, 2))
X

Out[21]:
array([[ 0,  1],
[ 2,  3],
[ 4,  5],
[ 6,  7],
[ 8,  9],
[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]])
In [22]:
T = np.arange(10).reshape((-1, 1))
T

Out[22]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
In [23]:
X = np.arange(20).reshape((10, 2)) + np.random.uniform(0, 1, (10, 2))
T = np.arange(10).reshape((-1, 1))
Xtrain, Ttrain, Xval, Tval, Xtest, Ttest = partition(X, T, 0.6, 0.2, 0.2)

In [24]:
print("Xtrain:")
print(Xtrain)
print(" Ttrain:")
print(Ttrain)
print("\n Xval:")
print(Xval)
print(" Tval:")
print(Tval)
print("\n Xtest:")
print(Xtest)
print(" Ttest:")
print(Ttest)

Xtrain:
[[18.01 19.54]
[ 6.07  7.55]
[ 2.07  3.24]
[ 4.16  5.64]
[ 8.18  9.  ]
[14.25 15.88]]
Ttrain:
[[9]
[3]
[1]
[2]
[4]
[7]]

Xval:
[[12.85 14.  ]
[ 0.98  1.19]]
Tval:
[[6]
[0]]

Xtest:
[[16.63 17.31]
[10.39 11.14]]
Ttest:
[[8]
[5]]


Now we can train our model using several values of $\lambda$ on Xtrain,Train and calculate the model error on Xval,Tval. Then pick best value of $\lambda$ based on error on Xval,Tval. Finally, calculate error of model using best $\lambda$ on Xtest,Ttest as our estimate of error on new data.

However, these errors will be affected by the random partitioning of the data. Repeating with new partitions may result in a different $\lambda$ being best. We should repeat the above procedure many times and average over the results. How many repetitions do we need?

Another approach, commonly followed in machine learning, is to first partition the data into $k$ subsets, called "folds". Pick one fold to be the test partition, another fold to be the validate partition, and collect the remaining folds to be the train partition. We can do this $k\,(k-1)$ ways. (Why?) So, with $k=5$ we get 20 repetitions performed in a way that most distributes samples among the partitions.

First, a little note on the yield statement in python. The yield statement is like return except that execution pauses at this point in the function, after returning the values. When the function is called again, it continues from that paused point.

In [25]:
def count(n):
for a in range(n):
yield a

In [26]:
count(4)

Out[26]:
<generator object count at 0x7feb5b361938>
In [27]:
list(count(4))

Out[27]:
[0, 1, 2, 3]
In [28]:
for i in count(5):
print(i)

0
1
2
3
4

In [29]:
 def partitionKFolds(X, T, nFolds, shuffle=False, nPartitions=3):
'''Usage: for Xtrain,Ttrain,Xval,Tval,Xtest,Ttext in partitionKFolds(X,T,5,shuffle=True,nPartitions=3):
for Xtrain,Ttrain,Xtest,Ttext in partitionKFolds(X,T,5,shuffle=True,nPartitions=2):'''
# Randomly arrange row indices
rowIndices = np.arange(X.shape[0])
if shuffle:
np.random.shuffle(rowIndices)
# Calculate number of samples in each of the nFolds folds
nSamples = X.shape[0]
nEach = int(nSamples / nFolds)
if nEach == 0:
raise ValueError("partitionKFolds: Number of samples in each fold is 0.")
# Calculate the starting and stopping row index for each fold.
# Store in startsStops as list of (start,stop) pairs
starts = np.arange(0, nEach * nFolds, nEach)
stops = starts + nEach
stops[-1] = nSamples
startsStops = list(zip(starts, stops))
# Repeat with testFold taking each single fold, one at a time
for testFold in range(nFolds):
if nPartitions == 3:
# Repeat with validateFold taking each single fold, except for the testFold
for validateFold in range(nFolds):
if testFold == validateFold:
continue
# trainFolds are all remaining folds, after selecting test and validate folds
trainFolds = np.setdiff1d(range(nFolds), [testFold,validateFold])
# Construct Xtrain and Ttrain by collecting rows for all trainFolds
rows = []
for tf in trainFolds:
a,b = startsStops[tf]
rows += rowIndices[a:b].tolist()
Xtrain = X[rows, :]
Ttrain = T[rows, :]
# Construct Xvalidate and Tvalidate
a,b = startsStops[validateFold]
rows = rowIndices[a:b]
Xvalidate = X[rows,:]
Tvalidate = T[rows,:]
# Construct Xtest and Ttest
a,b = startsStops[testFold]
rows = rowIndices[a:b]
Xtest = X[rows,:]
Ttest = T[rows,:]
# Return partition matrices, then suspend until called again.
yield Xtrain, Ttrain, Xvalidate, Tvalidate, Xtest, Ttest, testFold
else:
# trainFolds are all remaining folds, after selecting test and validate folds
trainFolds = np.setdiff1d(range(nFolds), [testFold])
# Construct Xtrain and Ttrain by collecting rows for all trainFolds
rows = []
for tf in trainFolds:
a,b = startsStops[tf]
rows += rowIndices[a:b].tolist()
Xtrain = X[rows, :]
Ttrain = T[rows, :]
# Construct Xtest and Ttest
a,b = startsStops[testFold]
rows = rowIndices[a:b]
Xtest = X[rows, :]
Ttest = T[rows, :]
# Return partition matrices, then suspend until called again.
yield Xtrain, Ttrain, Xtest, Ttest, testFold

In [30]:
X = np.arange(20).reshape((10, 2))
T = np.arange(10).reshape((-1, 1)) + np.random.uniform(-1, 1, (10, 1))
k = 0
for Xtrain, Ttrain, Xval, Tval, Xtest, Ttest, testFold in partitionKFolds(X, T, 5):
k += 1
print("Fold",k)
print(" Xtrain:")
print(Xtrain)
print(" Ttrain:")
print(Ttrain)
print("\n Xval:")
print(Xval)
print(" Tval:")
print(Tval)
print("\n Xtest:")
print(Xtest)
print(" Ttest:")
print(Ttest)

Fold 1
Xtrain:
[[ 8  9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
Ttrain:
[[4.09]
[5.88]
[6.55]
[7.46]
[8.1 ]
[8.95]]

Xval:
[[4 5]
[6 7]]
Tval:
[[1.44]
[3.19]]

Xtest:
[[0 1]
[2 3]]
Ttest:
[[-0.05]
[ 1.53]]
Fold 2
Xtrain:
[[ 4  5]
[ 6  7]
[12 13]
[14 15]
[16 17]
[18 19]]
Ttrain:
[[1.44]
[3.19]
[6.55]
[7.46]
[8.1 ]
[8.95]]

Xval:
[[ 8  9]
[10 11]]
Tval:
[[4.09]
[5.88]]

Xtest:
[[0 1]
[2 3]]
Ttest:
[[-0.05]
[ 1.53]]
Fold 3
Xtrain:
[[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[16 17]
[18 19]]
Ttrain:
[[1.44]
[3.19]
[4.09]
[5.88]
[8.1 ]
[8.95]]

Xval:
[[12 13]
[14 15]]
Tval:
[[6.55]
[7.46]]

Xtest:
[[0 1]
[2 3]]
Ttest:
[[-0.05]
[ 1.53]]
Fold 4
Xtrain:
[[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[12 13]
[14 15]]
Ttrain:
[[1.44]
[3.19]
[4.09]
[5.88]
[6.55]
[7.46]]

Xval:
[[16 17]
[18 19]]
Tval:
[[8.1 ]
[8.95]]

Xtest:
[[0 1]
[2 3]]
Ttest:
[[-0.05]
[ 1.53]]
Fold 5
Xtrain:
[[ 8  9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
Ttrain:
[[4.09]
[5.88]
[6.55]
[7.46]
[8.1 ]
[8.95]]

Xval:
[[0 1]
[2 3]]
Tval:
[[-0.05]
[ 1.53]]

Xtest:
[[4 5]
[6 7]]
Ttest:
[[1.44]
[3.19]]
Fold 6
Xtrain:
[[ 0  1]
[ 2  3]
[12 13]
[14 15]
[16 17]
[18 19]]
Ttrain:
[[-0.05]
[ 1.53]
[ 6.55]
[ 7.46]
[ 8.1 ]
[ 8.95]]

Xval:
[[ 8  9]
[10 11]]
Tval:
[[4.09]
[5.88]]

Xtest:
[[4 5]
[6 7]]
Ttest:
[[1.44]
[3.19]]
Fold 7
Xtrain:
[[ 0  1]
[ 2  3]
[ 8  9]
[10 11]
[16 17]
[18 19]]
Ttrain:
[[-0.05]
[ 1.53]
[ 4.09]
[ 5.88]
[ 8.1 ]
[ 8.95]]

Xval:
[[12 13]
[14 15]]
Tval:
[[6.55]
[7.46]]

Xtest:
[[4 5]
[6 7]]
Ttest:
[[1.44]
[3.19]]
Fold 8
Xtrain:
[[ 0  1]
[ 2  3]
[ 8  9]
[10 11]
[12 13]
[14 15]]
Ttrain:
[[-0.05]
[ 1.53]
[ 4.09]
[ 5.88]
[ 6.55]
[ 7.46]]

Xval:
[[16 17]
[18 19]]
Tval:
[[8.1 ]
[8.95]]

Xtest:
[[4 5]
[6 7]]
Ttest:
[[1.44]
[3.19]]
Fold 9
Xtrain:
[[ 4  5]
[ 6  7]
[12 13]
[14 15]
[16 17]
[18 19]]
Ttrain:
[[1.44]
[3.19]
[6.55]
[7.46]
[8.1 ]
[8.95]]

Xval:
[[0 1]
[2 3]]
Tval:
[[-0.05]
[ 1.53]]

Xtest:
[[ 8  9]
[10 11]]
Ttest:
[[4.09]
[5.88]]
Fold 10
Xtrain:
[[ 0  1]
[ 2  3]
[12 13]
[14 15]
[16 17]
[18 19]]
Ttrain:
[[-0.05]
[ 1.53]
[ 6.55]
[ 7.46]
[ 8.1 ]
[ 8.95]]

Xval:
[[4 5]
[6 7]]
Tval:
[[1.44]
[3.19]]

Xtest:
[[ 8  9]
[10 11]]
Ttest:
[[4.09]
[5.88]]
Fold 11
Xtrain:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[16 17]
[18 19]]
Ttrain:
[[-0.05]
[ 1.53]
[ 1.44]
[ 3.19]
[ 8.1 ]
[ 8.95]]

Xval:
[[12 13]
[14 15]]
Tval:
[[6.55]
[7.46]]

Xtest:
[[ 8  9]
[10 11]]
Ttest:
[[4.09]
[5.88]]
Fold 12
Xtrain:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[12 13]
[14 15]]
Ttrain:
[[-0.05]
[ 1.53]
[ 1.44]
[ 3.19]
[ 6.55]
[ 7.46]]

Xval:
[[16 17]
[18 19]]
Tval:
[[8.1 ]
[8.95]]

Xtest:
[[ 8  9]
[10 11]]
Ttest:
[[4.09]
[5.88]]
Fold 13
Xtrain:
[[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[16 17]
[18 19]]
Ttrain:
[[1.44]
[3.19]
[4.09]
[5.88]
[8.1 ]
[8.95]]

Xval:
[[0 1]
[2 3]]
Tval:
[[-0.05]
[ 1.53]]

Xtest:
[[12 13]
[14 15]]
Ttest:
[[6.55]
[7.46]]
Fold 14
Xtrain:
[[ 0  1]
[ 2  3]
[ 8  9]
[10 11]
[16 17]
[18 19]]
Ttrain:
[[-0.05]
[ 1.53]
[ 4.09]
[ 5.88]
[ 8.1 ]
[ 8.95]]

Xval:
[[4 5]
[6 7]]
Tval:
[[1.44]
[3.19]]

Xtest:
[[12 13]
[14 15]]
Ttest:
[[6.55]
[7.46]]
Fold 15
Xtrain:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[16 17]
[18 19]]
Ttrain:
[[-0.05]
[ 1.53]
[ 1.44]
[ 3.19]
[ 8.1 ]
[ 8.95]]

Xval:
[[ 8  9]
[10 11]]
Tval:
[[4.09]
[5.88]]

Xtest:
[[12 13]
[14 15]]
Ttest:
[[6.55]
[7.46]]
Fold 16
Xtrain:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]]
Ttrain:
[[-0.05]
[ 1.53]
[ 1.44]
[ 3.19]
[ 4.09]
[ 5.88]]

Xval:
[[16 17]
[18 19]]
Tval:
[[8.1 ]
[8.95]]

Xtest:
[[12 13]
[14 15]]
Ttest:
[[6.55]
[7.46]]
Fold 17
Xtrain:
[[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[12 13]
[14 15]]
Ttrain:
[[1.44]
[3.19]
[4.09]
[5.88]
[6.55]
[7.46]]

Xval:
[[0 1]
[2 3]]
Tval:
[[-0.05]
[ 1.53]]

Xtest:
[[16 17]
[18 19]]
Ttest:
[[8.1 ]
[8.95]]
Fold 18
Xtrain:
[[ 0  1]
[ 2  3]
[ 8  9]
[10 11]
[12 13]
[14 15]]
Ttrain:
[[-0.05]
[ 1.53]
[ 4.09]
[ 5.88]
[ 6.55]
[ 7.46]]

Xval:
[[4 5]
[6 7]]
Tval:
[[1.44]
[3.19]]

Xtest:
[[16 17]
[18 19]]
Ttest:
[[8.1 ]
[8.95]]
Fold 19
Xtrain:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[12 13]
[14 15]]
Ttrain:
[[-0.05]
[ 1.53]
[ 1.44]
[ 3.19]
[ 6.55]
[ 7.46]]

Xval:
[[ 8  9]
[10 11]]
Tval:
[[4.09]
[5.88]]

Xtest:
[[16 17]
[18 19]]
Ttest:
[[8.1 ]
[8.95]]
Fold 20
Xtrain:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]]
Ttrain:
[[-0.05]
[ 1.53]
[ 1.44]
[ 3.19]
[ 4.09]
[ 5.88]]

Xval:
[[12 13]
[14 15]]
Tval:
[[6.55]
[7.46]]

Xtest:
[[16 17]
[18 19]]
Ttest:
[[8.1 ]
[8.95]]


Typical use of these partitions is shown below. It is most handy to just collect all results in a matrix and calculate averages afterwards, rather than accumulating each result and dividing by the number of repetitions at the end.

In [31]:
def train(X, T, lamb):
means = X.mean(0)
stds = X.std(0)
n,d = X.shape
Xs1 = np.insert( (X - means)/stds, 0, 1, axis=1)
lambDiag = np.eye(d+1) * lamb
lambDiag[0,0] = 0
w = np.linalg.lstsq( Xs1.T @ Xs1 + lambDiag, Xs1.T @ T, rcond=None)[0]
return {'w': w, 'means': means, 'stds': stds}

def use(model, X):
Xs1 = np.insert((X-model['means'])/model['stds'], 0, 1, axis=1)
return Xs1 @ model['w']

def rmse(A,B):
return np.sqrt(np.mean( (A-B)**2 ))

def evaluate(model, X, T):
return rmse(use(model, X), T)

In [32]:
lambdas = np.linspace(0, 5, 20)
results = []
for Xtrain, Ttrain, Xval, Tval, Xtest, Ttest,_ in partitionKFolds(X, T, 5):
for lamb in lambdas:
model = train(Xtrain, Ttrain, lamb)
predict = use(model, Xval)
results.append([lamb,
evaluate(model, Xtrain, Ttrain),
evaluate(model, Xval, Tval),
evaluate(model, Xtest, Ttest)])
results = np.array(results)
# print(results)
avgresults = []
for lam in lambdas:
print(lam)
print(results[results[:,0]==lam,1:])
avgresults.append( [lam] + np.mean(results[results[:,0]==lam,1:],axis=0).tolist())
avgresults = np.array(avgresults)
print(avgresults)

0.0
[[0.29 0.97 0.72]
[0.33 0.58 0.51]
[0.42 0.36 0.39]
[0.3  1.06 0.79]
[0.29 0.72 0.97]
[0.26 0.46 0.6 ]
[0.34 0.29 0.6 ]
[0.27 0.68 0.62]
[0.33 0.51 0.58]
[0.26 0.6  0.46]
[0.33 0.49 0.61]
[0.34 0.58 0.41]
[0.42 0.39 0.36]
[0.34 0.6  0.29]
[0.33 0.61 0.49]
[0.4  0.87 0.18]
[0.3  0.79 1.06]
[0.27 0.62 0.68]
[0.34 0.41 0.58]
[0.4  0.18 0.87]]
0.2631578947368421
[[0.29 1.05 0.83]
[0.34 0.56 0.42]
[0.42 0.39 0.33]
[0.3  0.95 0.69]
[0.29 0.83 1.05]
[0.27 0.46 0.65]
[0.35 0.33 0.64]
[0.27 0.57 0.65]
[0.34 0.42 0.56]
[0.27 0.65 0.46]
[0.33 0.55 0.63]
[0.35 0.46 0.44]
[0.42 0.33 0.39]
[0.35 0.64 0.33]
[0.33 0.63 0.55]
[0.41 0.72 0.1 ]
[0.3  0.69 0.95]
[0.27 0.65 0.57]
[0.35 0.44 0.46]
[0.41 0.1  0.72]]
0.5263157894736842
[[0.29 1.12 0.93]
[0.35 0.55 0.35]
[0.43 0.42 0.3 ]
[0.31 0.85 0.6 ]
[0.29 0.93 1.12]
[0.3  0.47 0.7 ]
[0.37 0.37 0.68]
[0.29 0.47 0.68]
[0.35 0.35 0.55]
[0.3  0.7  0.47]
[0.36 0.6  0.65]
[0.36 0.34 0.46]
[0.43 0.3  0.42]
[0.37 0.68 0.37]
[0.36 0.65 0.6 ]
[0.41 0.59 0.09]
[0.31 0.6  0.85]
[0.29 0.68 0.47]
[0.36 0.46 0.34]
[0.41 0.09 0.59]]
0.7894736842105263
[[0.3  1.19 1.04]
[0.37 0.54 0.3 ]
[0.45 0.44 0.31]
[0.32 0.76 0.52]
[0.3  1.04 1.19]
[0.34 0.47 0.75]
[0.39 0.41 0.72]
[0.32 0.37 0.71]
[0.37 0.3  0.54]
[0.34 0.75 0.47]
[0.39 0.65 0.66]
[0.38 0.23 0.49]
[0.45 0.31 0.44]
[0.39 0.72 0.41]
[0.39 0.66 0.65]
[0.42 0.45 0.15]
[0.32 0.52 0.76]
[0.32 0.71 0.37]
[0.38 0.49 0.23]
[0.42 0.15 0.45]]
1.0526315789473684
[[0.31 1.26 1.14]
[0.4  0.53 0.3 ]
[0.47 0.47 0.33]
[0.34 0.67 0.45]
[0.31 1.14 1.26]
[0.38 0.47 0.79]
[0.43 0.45 0.76]
[0.34 0.28 0.74]
[0.4  0.3  0.53]
[0.38 0.79 0.47]
[0.43 0.7  0.68]
[0.41 0.12 0.51]
[0.47 0.33 0.47]
[0.43 0.76 0.45]
[0.43 0.68 0.7 ]
[0.43 0.33 0.23]
[0.34 0.45 0.67]
[0.34 0.74 0.28]
[0.41 0.51 0.12]
[0.43 0.23 0.33]]
1.3157894736842104
[[0.32 1.32 1.24]
[0.43 0.52 0.33]
[0.49 0.49 0.38]
[0.36 0.58 0.38]
[0.32 1.24 1.32]
[0.43 0.48 0.84]
[0.47 0.48 0.79]
[0.38 0.19 0.76]
[0.43 0.33 0.52]
[0.43 0.84 0.48]
[0.47 0.75 0.69]
[0.44 0.06 0.53]
[0.49 0.38 0.49]
[0.47 0.79 0.48]
[0.47 0.69 0.75]
[0.45 0.21 0.3 ]
[0.36 0.38 0.58]
[0.38 0.76 0.19]
[0.44 0.53 0.06]
[0.45 0.3  0.21]]
1.5789473684210527
[[0.34 1.39 1.33]
[0.46 0.52 0.39]
[0.52 0.52 0.44]
[0.38 0.49 0.33]
[0.34 1.33 1.39]
[0.47 0.48 0.88]
[0.51 0.52 0.82]
[0.41 0.1  0.79]
[0.46 0.39 0.52]
[0.47 0.88 0.48]
[0.51 0.8  0.71]
[0.47 0.11 0.56]
[0.52 0.44 0.52]
[0.51 0.82 0.52]
[0.51 0.71 0.8 ]
[0.46 0.1  0.38]
[0.38 0.33 0.49]
[0.41 0.79 0.1 ]
[0.47 0.56 0.11]
[0.46 0.38 0.1 ]]
1.8421052631578947
[[0.35 1.45 1.42]
[0.49 0.51 0.46]
[0.54 0.54 0.5 ]
[0.4  0.41 0.3 ]
[0.35 1.42 1.45]
[0.52 0.49 0.93]
[0.55 0.55 0.86]
[0.45 0.03 0.81]
[0.49 0.46 0.51]
[0.52 0.93 0.49]
[0.56 0.84 0.72]
[0.5  0.2  0.58]
[0.54 0.5  0.54]
[0.55 0.86 0.55]
[0.56 0.72 0.84]
[0.48 0.07 0.45]
[0.4  0.3  0.41]
[0.45 0.81 0.03]
[0.5  0.58 0.2 ]
[0.48 0.45 0.07]]
2.1052631578947367
[[0.37 1.5  1.51]
[0.52 0.51 0.53]
[0.57 0.56 0.57]
[0.43 0.34 0.28]
[0.37 1.51 1.5 ]
[0.57 0.49 0.97]
[0.59 0.58 0.89]
[0.48 0.07 0.84]
[0.52 0.53 0.51]
[0.57 0.97 0.49]
[0.61 0.89 0.73]
[0.53 0.29 0.6 ]
[0.57 0.57 0.56]
[0.59 0.89 0.58]
[0.61 0.73 0.89]
[0.49 0.15 0.53]
[0.43 0.28 0.34]
[0.48 0.84 0.07]
[0.53 0.6  0.29]
[0.49 0.53 0.15]]
2.3684210526315788
[[0.38 1.56 1.59]
[0.55 0.5  0.6 ]
[0.6  0.58 0.63]
[0.45 0.26 0.29]
[0.38 1.59 1.56]
[0.62 0.5  1.01]
[0.63 0.62 0.92]
[0.52 0.14 0.86]
[0.55 0.6  0.5 ]
[0.62 1.01 0.5 ]
[0.65 0.93 0.75]
[0.57 0.38 0.62]
[0.6  0.63 0.58]
[0.63 0.92 0.62]
[0.65 0.75 0.93]
[0.51 0.25 0.59]
[0.45 0.29 0.26]
[0.52 0.86 0.14]
[0.57 0.62 0.38]
[0.51 0.59 0.25]]
2.631578947368421
[[0.4  1.62 1.67]
[0.59 0.5  0.68]
[0.63 0.6  0.7 ]
[0.47 0.19 0.31]
[0.4  1.67 1.62]
[0.67 0.5  1.05]
[0.68 0.65 0.95]
[0.55 0.22 0.88]
[0.59 0.68 0.5 ]
[0.67 1.05 0.5 ]
[0.7  0.97 0.76]
[0.6  0.46 0.64]
[0.63 0.7  0.6 ]
[0.68 0.95 0.65]
[0.7  0.76 0.97]
[0.53 0.35 0.66]
[0.47 0.31 0.19]
[0.55 0.88 0.22]
[0.6  0.64 0.46]
[0.53 0.66 0.35]]
2.894736842105263
[[0.42 1.67 1.75]
[0.62 0.5  0.75]
[0.66 0.62 0.76]
[0.5  0.12 0.35]
[0.42 1.75 1.67]
[0.71 0.51 1.08]
[0.72 0.67 0.97]
[0.59 0.29 0.91]
[0.62 0.75 0.5 ]
[0.71 1.08 0.51]
[0.74 1.01 0.77]
[0.63 0.55 0.66]
[0.66 0.76 0.62]
[0.72 0.97 0.67]
[0.74 0.77 1.01]
[0.55 0.45 0.73]
[0.5  0.35 0.12]
[0.59 0.91 0.29]
[0.63 0.66 0.55]
[0.55 0.73 0.45]]
3.1578947368421053
[[0.43 1.72 1.83]
[0.65 0.5  0.83]
[0.69 0.64 0.82]
[0.52 0.06 0.39]
[0.43 1.83 1.72]
[0.76 0.51 1.12]
[0.76 0.7  1.  ]
[0.62 0.36 0.93]
[0.65 0.83 0.5 ]
[0.76 1.12 0.51]
[0.78 1.05 0.78]
[0.67 0.63 0.68]
[0.69 0.82 0.64]
[0.76 1.   0.7 ]
[0.78 0.78 1.05]
[0.57 0.54 0.79]
[0.52 0.39 0.06]
[0.62 0.93 0.36]
[0.67 0.68 0.63]
[0.57 0.79 0.54]]
3.4210526315789473
[[0.45 1.77 1.9 ]
[0.68 0.5  0.9 ]
[0.72 0.66 0.88]
[0.54 0.04 0.44]
[0.45 1.9  1.77]
[0.8  0.52 1.16]
[0.8  0.73 1.03]
[0.66 0.43 0.95]
[0.68 0.9  0.5 ]
[0.8  1.16 0.52]
[0.83 1.08 0.79]
[0.7  0.7  0.7 ]
[0.72 0.88 0.66]
[0.8  1.03 0.73]
[0.83 0.79 1.08]
[0.59 0.63 0.85]
[0.54 0.44 0.04]
[0.66 0.95 0.43]
[0.7  0.7  0.7 ]
[0.59 0.85 0.63]]
3.6842105263157894
[[0.46 1.81 1.97]
[0.71 0.51 0.97]
[0.74 0.68 0.94]
[0.56 0.09 0.49]
[0.46 1.97 1.81]
[0.84 0.52 1.19]
[0.83 0.76 1.05]
[0.69 0.49 0.97]
[0.71 0.97 0.51]
[0.84 1.19 0.52]
[0.87 1.12 0.81]
[0.73 0.78 0.72]
[0.74 0.94 0.68]
[0.83 1.05 0.76]
[0.87 0.81 1.12]
[0.6  0.72 0.91]
[0.56 0.49 0.09]
[0.69 0.97 0.49]
[0.73 0.72 0.78]
[0.6  0.91 0.72]]
3.9473684210526314
[[0.48 1.86 2.04]
[0.74 0.51 1.04]
[0.77 0.7  1.  ]
[0.59 0.15 0.54]
[0.48 2.04 1.86]
[0.88 0.53 1.22]
[0.87 0.78 1.08]
[0.72 0.55 0.98]
[0.74 1.04 0.51]
[0.88 1.22 0.53]
[0.91 1.15 0.82]
[0.76 0.85 0.74]
[0.77 1.   0.7 ]
[0.87 1.08 0.78]
[0.91 0.82 1.15]
[0.62 0.8  0.96]
[0.59 0.54 0.15]
[0.72 0.98 0.55]
[0.76 0.74 0.85]
[0.62 0.96 0.8 ]]
4.2105263157894735
[[0.49 1.9  2.1 ]
[0.77 0.51 1.1 ]
[0.8  0.72 1.06]
[0.61 0.21 0.59]
[0.49 2.1  1.9 ]
[0.92 0.53 1.26]
[0.91 0.81 1.1 ]
[0.75 0.61 1.  ]
[0.77 1.1  0.51]
[0.92 1.26 0.53]
[0.95 1.18 0.83]
[0.79 0.92 0.76]
[0.8  1.06 0.72]
[0.91 1.1  0.81]
[0.95 0.83 1.18]
[0.64 0.88 1.02]
[0.61 0.59 0.21]
[0.75 1.   0.61]
[0.79 0.76 0.92]
[0.64 1.02 0.88]]
4.473684210526316
[[0.51 1.95 2.17]
[0.8  0.52 1.17]
[0.83 0.73 1.11]
[0.63 0.26 0.64]
[0.51 2.17 1.95]
[0.96 0.54 1.29]
[0.95 0.83 1.13]
[0.78 0.67 1.02]
[0.8  1.17 0.52]
[0.96 1.29 0.54]
[0.98 1.21 0.84]
[0.82 0.99 0.77]
[0.83 1.11 0.73]
[0.95 1.13 0.83]
[0.98 0.84 1.21]
[0.66 0.96 1.07]
[0.63 0.64 0.26]
[0.78 1.02 0.67]
[0.82 0.77 0.99]
[0.66 1.07 0.96]]
4.7368421052631575
[[0.52 1.99 2.23]
[0.83 0.52 1.23]
[0.85 0.75 1.17]
[0.65 0.32 0.69]
[0.52 2.23 1.99]
[1.   0.54 1.32]
[0.98 0.85 1.15]
[0.81 0.73 1.04]
[0.83 1.23 0.52]
[1.   1.32 0.54]
[1.02 1.25 0.85]
[0.85 1.06 0.79]
[0.85 1.17 0.75]
[0.98 1.15 0.85]
[1.02 0.85 1.25]
[0.67 1.04 1.12]
[0.65 0.69 0.32]
[0.81 1.04 0.73]
[0.85 0.79 1.06]
[0.67 1.12 1.04]]
5.0
[[0.54 2.03 2.29]
[0.86 0.53 1.29]
[0.88 0.77 1.22]
[0.67 0.37 0.74]
[0.54 2.29 2.03]
[1.04 0.55 1.35]
[1.01 0.87 1.17]
[0.84 0.79 1.05]
[0.86 1.29 0.53]
[1.04 1.35 0.55]
[1.06 1.28 0.85]
[0.88 1.12 0.81]
[0.88 1.22 0.77]
[1.01 1.17 0.87]
[1.06 0.85 1.28]
[0.69 1.11 1.17]
[0.67 0.74 0.37]
[0.84 1.05 0.79]
[0.88 0.81 1.12]
[0.69 1.17 1.11]]
[[0.   0.33 0.59 0.59]
[0.26 0.33 0.57 0.57]
[0.53 0.35 0.56 0.56]
[0.79 0.37 0.56 0.56]
[1.05 0.39 0.56 0.56]
[1.32 0.42 0.57 0.57]
[1.58 0.45 0.58 0.58]
[1.84 0.48 0.61 0.61]
[2.11 0.52 0.64 0.64]
[2.37 0.55 0.68 0.68]
[2.63 0.58 0.72 0.72]
[2.89 0.61 0.76 0.76]
[3.16 0.64 0.79 0.79]
[3.42 0.68 0.83 0.83]
[3.68 0.71 0.87 0.87]
[3.95 0.73 0.92 0.92]
[4.21 0.76 0.96 0.96]
[4.47 0.79 0.99 0.99]
[4.74 0.82 1.03 1.03]
[5.   0.85 1.07 1.07]]

In [33]:
plt.plot(avgresults[:,0],avgresults[:,[1, 3]],'o-')
plt.xlabel('$\lambda$')
plt.ylabel('RMSE')
plt.legend(('Train', 'Test'),loc='best');


Here is yet another python implementation that includes many of the common steps used when training and evaluating models. We will use this version for many of our experiments.

In [34]:
def trainValidateTestKFolds(trainf, evaluatef, X, T, parameterSets, nFolds=5,
shuffle=False, verbose=False):
# Randomly arrange row indices
rowIndices = np.arange(X.shape[0])
if shuffle:
np.random.shuffle(rowIndices)
isNewBetterThanOld = lambda new,old: new < old
# Calculate number of samples in each of the nFolds folds
nSamples = X.shape[0]
nEach = int(nSamples / nFolds)
if nEach == 0:
raise ValueError("trainValidateTestKFolds: Number of samples in each fold is 0.")
# Calculate the starting and stopping row index for each fold.
# Store in startsStops as list of (start,stop) pairs
starts = np.arange(0,nEach*nFolds, nEach)
stops = starts + nEach
stops[-1] = nSamples
startsStops = list(zip(starts, stops))

# Repeat with testFold taking each single fold, one at a time
results = []
for testFold in range(nFolds):
# Leaving the testFold out, for each validate fold, train on remaining
# folds and evaluate on validate fold. Collect theseRepeat with validate
# Construct Xtest and Ttest
a,b = startsStops[testFold]
rows = rowIndices[a:b]
Xtest = X[rows, :]
Ttest = T[rows, :]

bestParms = None
for parms in parameterSets:
# trainEvaluationSum = 0
validateEvaluationSum = 0
for validateFold in range(nFolds):
if testFold == validateFold:
continue
# Construct Xtrain and Ttrain
trainFolds = np.setdiff1d(range(nFolds), [testFold, validateFold])
rows = []
for tf in trainFolds:
a,b = startsStops[tf]
rows += rowIndices[a:b].tolist()
Xtrain = X[rows, :]
Ttrain = T[rows, :]
# Construct Xvalidate and Tvalidate
a,b = startsStops[validateFold]
rows = rowIndices[a:b]
Xvalidate = X[rows, :]
Tvalidate = T[rows, :]

model = trainf(Xtrain, Ttrain, parms)
# trainEvaluationSum += evaluatef(model,Xtrain,Train)
validateEvaluationSum += evaluatef(model, Xvalidate, Tvalidate)
validateEvaluation = validateEvaluationSum / (nFolds-1)
if verbose:
if hasattr(model, 'bestIteration') and model.bestIteration is not None:
print('{} Val {:.3f} Best Iter {:d}'.format(parms, validateEvaluation, model.bestIteration))
else:
print('{} Val {:.3f}'.format(parms, validateEvaluation))

if bestParms is None or isNewBetterThanOld(validateEvaluation, bestValidationEvaluation):
bestParms = parms
bestValidationEvaluation = validateEvaluation
if verbose:
print('New best')
# trainEvaluation = trainEvaluationSum / (nFolds-1)

newXtrain = np.vstack((Xtrain, Xvalidate))
newTtrain = np.vstack((Ttrain, Tvalidate))

model = trainf(newXtrain, newTtrain, bestParms)
trainEvaluation = evaluatef(model, newXtrain, newTtrain)
testEvaluation = evaluatef(model, Xtest, Ttest)

# resultThisTestFold = [bestParms, trainEvaluation,
#                       bestValidationEvaluation, testEvaluation]
resultThisTestFold = [nFolds, testFold+1, bestParms, trainEvaluation, bestValidationEvaluation, testEvaluation]
results.append(resultThisTestFold)
if verbose:
print(resultThisTestFold)
print('{}/{}'.format(testFold+1, nFolds), end=' ', flush=True)
return np.array(results) # pandas.DataFrame(results, columns=('nFolds','testFold','parms','trainAcc','testAcc'))

In [35]:
lambdas = np.linspace(0, 5, 20)
results = trainValidateTestKFolds(train, evaluate, X, T, lambdas, nFolds=5,
shuffle=True, verbose=False)

1/5 2/5 3/5 4/5 5/5
In [36]:
results

Out[36]:
array([[5.  , 1.  , 0.  , 0.39, 0.49, 0.39],
[5.  , 2.  , 0.53, 0.41, 0.46, 0.43],
[5.  , 3.  , 0.  , 0.31, 0.37, 0.64],
[5.  , 4.  , 0.53, 0.4 , 0.52, 0.28],
[5.  , 5.  , 0.26, 0.35, 0.46, 0.53]])
In [ ]:
plt.figure(figsize=(10, 10))
plt.subplot(2, 1, 1)
plt.plot(results[:, 1], results[:, 3:6], 'o-')
plt.xticks(range(1, 6))  # without this, we see 1.5, 2.5, etc.
plt.xlabel('Fold Index')
plt.ylabel('RMSE')
plt.legend(('Train','Validate','Test'),loc='best')

plt.subplot(2, 1, 2)
plt.plot(results[:, 1], results[:, 2], '-o')
plt.xticks(range(1, 6))
plt.xlabel('Fold Index')
plt.ylabel('Best $\lambda$');