%pylab inline
Populating the interactive namespace from numpy and matplotlib
This exercise covers cross-validation of regression models on the Diabetes dataset. The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442 patients, and an indication of disease progression after one year:
from sklearn.datasets import load_diabetes
data = load_diabetes()
X, y = data.data, data.target
print X.shape
(442, 10)
print y.shape
(442,)
Here we'll be fitting two regularized linear models, Ridge Regression, which uses $\ell_2$ regularlization, and Lasso Regression, which uses $\ell_1$ regularization.
from sklearn.linear_model import Ridge, Lasso
We'll first use the default hyper-parameters to see the baseline estimator. We'll use the cross-validation score to determine goodness-of-fit.
from sklearn.cross_validation import cross_val_score
for Model in [Ridge, Lasso]:
model = Model()
print Model.__name__, cross_val_score(model, X, y).mean()
Ridge 0.409427438303 Lasso 0.353800083299
We see that for the default hyper-parameter values, Lasso outperforms Ridge. But is this the case for the optimal hyperparameters of each model?
Here spend some time writing a function which computes the cross-validation
score as a function of alpha
, the strength of the regularization for
Lasso
and Ridge
. We'll choose 20 values of alpha
between
0.0001 and 1:
alphas = np.logspace(-3, -1, 30)
# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator
# as a function of alpha. Which is more difficult to tune?
%load solutions/06B_basic_grid_search.py
Because searching a grid of hyperparameters is such a common task, scikit-learn provides
several hyper-parameter estimators to automate this. We'll explore this more in depth
later in the tutorial, but for now it is interesting to see how GridSearchCV
works:
from sklearn.grid_search import GridSearchCV
GridSearchCV
is constructed with an estimator, as well as a dictionary
of parameter values to be searched. We can find the optimal parameters this
way:
for Model in [Ridge, Lasso]:
gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)
print Model.__name__, gscv.best_params_
Ridge {'alpha': 0.062101694189156162} Lasso {'alpha': 0.01268961003167922}
For some models within scikit-learn, cross-validation can be performed more efficiently
on large datasets. In this case, a cross-validated version of the particular model is
included. The cross-validated versions of Ridge
and Lasso
are RidgeCV
and
LassoCV
, respectively. The grid search on these estimators can be performed as
follows:
from sklearn.linear_model import RidgeCV, LassoCV
for Model in [RidgeCV, LassoCV]:
model = Model(alphas=alphas, cv=3).fit(X, y)
print Model.__name__, model.alpha_
RidgeCV 0.0621016941892 LassoCV 0.00788046281567
We see that the results match those returned by GridSearchCV
.
Here we'll apply our learning curves to the diabetes data. The question to answer is this:
You can follow the process used in the previous notebook to plot the learning curves.
A good metric to use is the mean_squared_error
, which we'll import below:
from sklearn.metrics import mean_squared_error
# define a function that computes the learning curve (i.e. mean_squared_error as a function
# of training set size, for both training and test sets) and plot the result
%load solutions/06B_learning_curves.py