%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(
style='ticks',
context='talk',
palette='Set1'
)
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter('ignore', FutureWarning)
Let's load the data and plot it.
dataset = load_diabetes()
feature_names = dataset.feature_names
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = dataset.target
X.head()
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 |
sns.pairplot(X);
sns.distplot(y);
We can now split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Here we perform standard linear regression.
Note scores can differ due to the train-test split!
reg = LinearRegression()
reg.fit(X_train, y_train)
print("Score:", reg.score(X_test, y_test))
Score: 0.5041314648870356
Now experiment with the Lasso model from the linear_model
package (Lasso: least absolute shrinkage and selection operator).
Lasso is a linear model that performs regularization: avoiding giving too much weight to any single feature. This is done by minimizing not just the sum of residual squares between the model predictions and the observed values (the fit), but also the sum of the absolute values of the model coefficients (the penalty):
$$ \hat{y} = a_0 + a_1 x_1 + \ldots + a_m x_m \\ f(a_1, \ldots, a_m) = \frac{1}{2n} \sum_{i=1}^{n}{(\hat{y}_i - y_i)^2} + \alpha \sum_{j=1}^{m} |a_j| $$This is useful for feature selection, that is, finding features that are less important and therefore get a zero coefficeint. Thus, is increases prediction accuracy as well as interpretability of the linear model.
A meta-parameter $\alpha$ is used to weight the penalty relative to the fit: the higher $\alpha$, the more weight is given to the penalty.
Perform the analysis with a Lasso model.
Add plots to analyse the regressor. Here I created a prediction plot on the left, which compares the predicted and real values of the test set; and on the right a bar plot of the feature importance, measured by the coefficients.
Note scores can differ due to the train-test split!
Reminder
### your code here
Lasso model score: 0.49
Bonus:
to avoid choosing your own $\alpha$, try the LassoCV
model (from scikit-learn) with the AlphaSelection
visualization (from the Yellowbrick library).
If you need to install Yellowbrick, run the following command in a separate cell: !python3.7 -m pip install yellowbrick
.
### your code here
LassoCV α: 0.077 LassoCV model score: 0.50