Lasso assignment¶

Yoav Ram¶

First, we load the diabetes dataset, and try to predict the progression of the diabetes condition from several features, including age, blood pressure, and some blood serum features.

In [10]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(
    style='ticks',
    context='talk',
    palette='Set1'
)

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import warnings
warnings.simplefilter('ignore', FutureWarning)

Let's load the data and plot it.

In [2]:

dataset = load_diabetes()
feature_names = dataset.feature_names
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = dataset.target
X.head()

Out[2]:

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641

In [3]:

sns.pairplot(X);

In [4]:

sns.distplot(y);

We can now split the data.

In [12]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Linear regression¶

Here we perform standard linear regression.

Note scores can differ due to the train-test split!

In [13]:

reg = LinearRegression()
reg.fit(X_train, y_train)
print("Score:", reg.score(X_test, y_test))

Score: 0.5041314648870356

Lasso regression¶

Now experiment with the Lasso model from the linear_model package (Lasso: least absolute shrinkage and selection operator).

Lasso is a linear model that performs regularization: avoiding giving too much weight to any single feature. This is done by minimizing not just the sum of residual squares between the model predictions and the observed values (the fit), but also the sum of the absolute values of the model coefficients (the penalty):

$$ \hat{y} = a_0 + a_1 x_1 + \ldots + a_m x_m \\ f(a_1, \ldots, a_m) = \frac{1}{2n} \sum_{i=1}^{n}{(\hat{y}_i - y_i)^2} + \alpha \sum_{j=1}^{m} |a_j| $$

This is useful for feature selection, that is, finding features that are less important and therefore get a zero coefficeint. Thus, is increases prediction accuracy as well as interpretability of the linear model.

A meta-parameter $\alpha$ is used to weight the penalty relative to the fit: the higher $\alpha$, the more weight is given to the penalty.

Ex 1¶

Perform the analysis with a Lasso model.

Add plots to analyse the regressor. Here I created a prediction plot on the left, which compares the predicted and real values of the test set; and on the right a bar plot of the feature importance, measured by the coefficients.

Note scores can differ due to the train-test split!

Reminder

Edit cell by double clicking
Run cell by pressing Shift+Enter
Get autocompletion by pressing Tab
Get documentation by pressing Shift+Tab

In [14]:

### your code here

Lasso model score: 0.49

Ex 2¶

Bonus: to avoid choosing your own $\alpha$, try the LassoCV model (from scikit-learn) with the AlphaSelection visualization (from the Yellowbrick library).

If you need to install Yellowbrick, run the following command in a separate cell: !python3.7 -m pip install yellowbrick.

In [15]:

### your code here

LassoCV α: 0.077
LassoCV model score: 0.50