Penalised Regression¶

YouTube Videos¶

Scikit Learn Linear Regression: https://www.youtube.com/watch?v=EvnpoUTXA0E
Scikit Learn Linear Penalise Regression: https://www.youtube.com/watch?v=RhsEAyDBkTQ

Introduction¶

We often do not want the coefficients/ weights to be too large. Hence we append the loss function with a penalty function to discourage large values of $w$.

\begin{align} \mathcal{L} & = \sum_{i=1}^N (y_i-f(x_i|w,b))^2 + \alpha \sum_{j=1}^D w_j^2 + \beta \sum_{j=1}^D |w_j| \end{align}

where, $f(x_i|w,b) = wx_i+b$. The values of $\alpha$ and $\beta$ are positive (or zero), with higher values enforcing the weights to be closer to zero.

Lesson Structure¶

The task of this lesson is to infer the weights given the data (observations, $y$ and inputs $x$).
We will be using the module sklearn.linear_model.

In [2]:

import numpy as np
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt
%matplotlib inline

# In order to reproduce the exact same number we need to set the seed for random number generators:
np.random.seed(1)

A normally distributed random looks as follows:

In [3]:

e = np.random.randn(10000,1)
plt.hist(e,100) #histogram with 100 bins
plt.ylabel('y')
plt.xlabel('x')
plt.title('Histogram of Normally Distributed Numbers')
plt.show()

Generate observations $y$ given feature (design) matrix $X$ according to: $$ y = Xw + \xi\\ \xi_i \sim \mathcal{N}(0,\sigma^2) $$

In this particular case, $w$ is a 100 dimensional vector where 90% of the numbers are zero. i.e. only 10 of the numbers are non-zero.

In [4]:

# Generate the data
N = 40 # Number of observations
D = 100 # Dimensionality

x = np.random.randn(N,D) # get random observations of x
w_true = np.zeros((D,1)) # create a weight vector of zeros
idx = np.random.choice(100,10,replace=False) # randomly choose 10 of those weights
w_true[idx] = np.random.randn(10,1) # populate then with 10 random weights

e = np.random.randn(N,1) # have a noise vector
y = np.matmul(x,w_true) + e # generate observations

# create validation set:
N_test = 50
x_test = np.random.randn(50,D)
y_test_true = np.matmul(x_test,w_true)

In [5]:

model = LinearRegression()
model.fit(x,y)

# plot the true vs estimated coeffiecients
plt.plot(np.arange(100),np.squeeze(model.coef_))
plt.plot(np.arange(100),w_true)
plt.legend(["Estimated","True"])
plt.title('Estimated Weights')
plt.show()

One way of testing how good your model is to look at metrics. In the case of regression Mean Squared Error (MSE) is a common metric which is defined as: $$ \frac{1}{N}\sum_{i=1}^N \xi_i^2$$ where, $\xi_i = y_i-f(x_i|w,b)$. Furthermore it is best to look at the MSE on a validation set, rather than on the training dataset that we used to train the model.

In [6]:

y_est = model.predict(x_test)
mse = np.mean(np.square(y_test_true-y_est))
print(mse)

6.79499954951

Ridge regression is where you penalise the weights by setting the $\alpha$ parameter right at the top. It penalises it so that the higher the square of the weights the higher the loss.

In [7]:

from sklearn.linear_model import Ridge

model = Ridge(alpha=5.0,fit_intercept = False)
# TODO: 
# Train the model: see how model was trained above, same code!

# plot the true vs estimated coeffiecients
plt.plot(np.arange(100),np.squeeze(model.coef_))
plt.plot(np.arange(100),w_true)
plt.legend(["Estimated","True"])
plt.show()

This model is slightly better than without any penalty on the weights.

In [8]:

y_est = model.predict(x_test)
mse = np.mean(np.square(y_test_true-y_est))
print(mse)

6.42288072501

Lasso is a model that encourages weights to go to zero exactly, as opposed to Ridge regression which encourages small weights.

In [9]:

from sklearn.linear_model import Lasso

# TODO: 
# 1. Set the model to a Lasso below, set the alpha parameter and set `fit_intercept=False`.
# 2. Train the model (same as before)
model = 

# plot the true vs estimated coeffiecients
plt.plot(np.arange(100),np.squeeze(model.coef_))
plt.plot(np.arange(100),w_true)
plt.legend(["Estimated","True"])
plt.title('Lasso regression weight inference')
plt.show()

The MSE is significantly better than both the above models.

In [10]:

y_est = model.predict(x_test)[:,None]
mse = np.mean(np.square(y_test_true-y_est))
print(mse)

2.30600134084

Automated Relevance Determination (ARD) regression is similar to lasso in that it encourages zero weights. However, the advantage is that you do not need to set a penalisation parameter, $\alpha$, $\beta$ in this model.

In [11]:

from sklearn.linear_model import ARDRegression

# TODO
# 1. Set model to ARDRegression and `fit_intercept=False`. 
# This model is not sensitive to its parameters which makes it powerful
# 2. Train the model, same as above
model = 

# plot the true vs estimated coeffiecients
plt.plot(np.arange(100),np.squeeze(model.coef_))
plt.plot(np.arange(100),w_true)
plt.legend(["Estimated","True"])
plt.show()

/root/miniconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

In [12]:

y_est = model.predict(x_test)[:,None]
mse = np.mean(np.square(y_test_true-y_est))
print(mse)

2.87016741296

Note:¶

Rerun the above with setting N=400

Inverse Problems (Optional)¶

The following section is optional and you may skip it. It is not necessary for understanding Deep Learning.

Inverse problems are where given the outputs you are required to infer the inputs. A typical example is X-rays. Given the x-ray sensor readings, the algorithm needs to build an image of an individuals bone structure.

See compressed_sensing for an example of l1 reguralisation applied to a compressed sensing problem (has a resemblance to the x-ray problem).

In [ ]: