$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Linear Regression with Fixed Nonlinear Features

The models we have been buildling are linear in the parameters $\wv$ and linear in the attributes (features) of the samples. We can make models that are nonlinear in the attributes by adding nonlinear functions of the original features.

Say we have a single feature for each sample. Our data matrix is $$ \begin{alignat*}{1} X &= \begin{bmatrix} x_0\\ x_1\\ \vdots \\ x_N \end{bmatrix} \end{alignat*} $$ We can add other powers of each $x$ value, say up to the fourth power. $$ \begin{alignat*}{1} X &= \begin{bmatrix} x_0 & x_0^2 & x_0^3 & x_0^4\\ x_1 & x_1^2 & x_1^3 & x_1^4\\ \vdots \\ x_N & x_N^2 & x_N^3 & x_N^4\\ \end{bmatrix} \end{alignat*} $$

This is simple to do in python.

X = np.hstack((X, X**2, X**3, X**4))

Which of these powers of $x$ are useful? Looking at the magnitudes of the weights is helpful, as long as the input features are standardized. We can do more than this, though. If we build multiple models from multiple bootstrap samples of the training data, we can compute confidence intervals of the weight values. If zero is not included in the range of weight values specified by a weight's 90% lower and upper confidencce limit, then we can say that we are 90% certain that the value of this weight is not zero. If the range does include zero, the corresponding feature is probably one that is not useful.

Here is some code that illustrates this whole process, including the use of the penalty on weight magnitudes controlled by $\lambda$.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
import random
In [ ]:
def train(X, T, lamb=0):
    means = X.mean(0)
    stds = X.std(0)
    n, d = X.shape
    Xs1 = np.insert((X - means)/stds, 0, 1, axis=1)
    lambDiag = np.eye(d+1) * lamb
    lambDiag[0, 0] = 0
    w = np.linalg.lstsq( Xs1.T @ Xs1 + lambDiag, Xs1.T @ T, rcond=None)[0]
    return {'w': w, 'means':means, 'stds':stds, 'lambda': lamb}

def use(model, X):
    Xs1 = np.insert((X - model['means'])/model['stds'], 0, 1, axis=1)
    return Xs1 @ model['w']

def rmse(A,B):
    return np.sqrt(np.mean( (A - B)**2 ))

Now, make a simple function of $x$. How about $f(x) = -1 + 0.1 x^2 - 0.02 x^3 + 0.5 n$, where $n$ is from a standard Normal distribution.

In [ ]:
nSamples = 40
trainingFraction = 0.5
nModels = 1000
confidence = 90 # percent
lambdaw = 0.0 # penalty on weight magnitudes

X = np.hstack((np.linspace(0, 3, num=nSamples),
               np.linspace(6, 10, num=nSamples))).reshape((2*nSamples, 1))
# T = -1 + 1 * X + 2 * np.sin(X*2) + 0.55*np.random.normal(size=(2*nSamples,1))
T = -1 + 0 * X + 0.1 * X**2 - 0.02 * X**3 + 0.5*np.random.normal(size=(2*nSamples, 1))
X.shape, T.shape
In [ ]:
plt.plot(X, T, '.-');

Let's add squared and cubed values of each feature.

In [ ]:
Xf = np.hstack((X, X**2, X**3, X**4, X**5))
#Xf = np.hstack((X, np.sin(X)))

Divide data into training and testing sets, randomly.

In [ ]:
round(7.8)
In [ ]:
nRows = Xf.shape[0]
rowIndices = np.arange(nRows)
np.random.shuffle(rowIndices)
nTrain = round(nRows * trainingFraction)
nTest = nRows - nTrain
trainI = rowIndices[:nTrain]
testI = rowIndices[nTrain:]

Xtrain = Xf[trainI, :]
Ttrain = T[trainI, :]
Xtest = Xf[testI, :]
Ttest = T[testI, :]

Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape
In [ ]:
plt.plot(Xtrain[:, 0], Ttrain, 'o', label='Train')
plt.plot(Xtest[:, 0], Ttest, 'ro', label='Test')
plt.legend(loc='best');

Make models based on bootstrap samples of training data. models will be list of models, one for each bootstrap sample.

In [ ]:
[random.randint(0,10) for i in range(20)]
In [ ]:
nModels = 100
models = []
for modeli in range(nModels):
    # Draw random sample (row) numbers, with repetition
    trainI = [random.randint(0, nTrain-1) for i in range(nTrain)]
    XtrainBoot = Xtrain[trainI, :]
    TtrainBoot = Ttrain[trainI, :]
    model = train(XtrainBoot, TtrainBoot, 0)
    models.append(model)
In [ ]:
len(models)
In [ ]:
models[0]

Now we will apply all of the models to the test data.

In [ ]:
YAll = []
for model in models:
    YAll.append( use(model, Xtest) )
In [ ]:
np.array(YAll).shape
In [ ]:
np.array(YAll).squeeze().shape
In [ ]:
YAll = np.array(YAll).squeeze().T 
Ytest = np.mean(YAll, axis=1)
In [ ]:
Ytest.shape
In [ ]:
RMSEtest = np.sqrt(np.mean((Ytest - Ttest)**2))
print('Test RMSE is {:.4f}'.format(RMSEtest))
In [ ]:
nPlot = 100
Xplot = np.linspace(0, 12.5, nPlot).reshape((nPlot ,1))
Xplotf = np.hstack((Xplot, Xplot**2, Xplot**3, Xplot**4, Xplot**5))
Ys = []
for model in models:
    Yplot = use(model, Xplotf)
    Ys.append( Yplot.flatten())
Ys = np.array(Ys).T

plt.figure(figsize=(10, 10))
plt.plot(Xtrain[:, 0], Ttrain, 'o')
plt.plot(Xtest[:, 0], Ttest, 'o')
plt.plot(Xplotf[:, 0], Ys, alpha=0.2);
plt.ylim(-14,2);

Now let's try some other data. Here is some data related to the design of hulls on yachts.

In [ ]:
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00243/yacht_hydrodynamics.data
!head yacht_hydrodynamics.data
In [ ]:
data = np.loadtxt('yacht_hydrodynamics.data')

T = data[:, -1:]
X = data[:, :-1]
Xnames = ['Center of Buoyancy', 'Prismatic coefficient', 'Length-displacement ratio', 'Beam-draught ratio',
          'Length-beam ratio', 'Froude number']
Tname = 'Resistance'
X.shape, T.shape, Xnames, Tname
In [ ]:
plt.figure(figsize=(10, 10))
for i in range(6):
    plt.subplot(2, 3, i+1)
    plt.plot(X[:, i] ,T, '.')
    plt.ylabel(Tname)
    plt.xlabel(Xnames[i])
plt.tight_layout()
In [ ]:
plt.plot(X[:100, :])
plt.plot(T[:100, :])
In [ ]:
model = train(X, T)
predict = use(model, X)
print(rmse(predict, T))
In [ ]:
plt.plot(T)
plt.plot(predict)
In [ ]:
plt.plot(T, predict, 'o')
plt.plot([0, 50], [0, 50],  'r-')
plt.xlabel('actual')
plt.ylabel('predicted')

Humm...that last variable, the Froude number, looks like its square root might be more linearly related to resistance.

In [ ]:
plt.plot(X[:,-1], T, 'o')
In [ ]:
plt.plot(X[:,-1]**2, T, 'o')
In [ ]:
plt.plot(X[:,-1]**4, T, 'o')
In [ ]:
plt.plot(X[:,-1]**8, T, 'o')
In [ ]:
Xf = np.hstack([X, X[:,-1:]**8])
In [ ]:
model = train(Xf, T)
predict = use(model, Xf)
print(rmse(predict, T))
In [ ]:
plt.plot(T)
plt.plot(predict)
In [ ]:
plt.plot(T, predict, 'o')
plt.plot([0, 50], [0, 50])
In [ ]:
n = 50
plt.plot(T[:n])
plt.plot(predict[:n])

Maybe higher powers would work better.

In [ ]:
result = []
for deg in range(1, 20):
    Xf = X # .copy()
    for d in range(2, deg+1):
        Xf = np.hstack((Xf, X**d))
    err = rmse( use( train(Xf, T), Xf), T)
    #print(deg, Xf.shape)
    result.append([deg,err])
result = np.array(result)
result
In [ ]:
plt.plot(result[:,0],result[:,1],'o-')
plt.xlabel('Exponent of X')
plt.ylabel('RMSE')
In [ ]:
Xf = np.hstack([X, X**2, X**3, X**4, X**5, X**6])
predict = use(train(Xf, T), Xf)

plt.plot(T)
plt.plot(predict)
In [ ]:
plt.plot(T, predict, 'o')
plt.plot([0, 50], [0, 50])
plt.xlabel('Actual')
plt.ylabel('Predicted');
In [ ]: