mlcourse.ai - Open Machine Learning Course

Author: Yury Kashnitsky. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.

Assignment #6 (demo). Solution

Exploring OLS, Lasso and Random Forest in a regression task

Same assignment as a Kaggle Kernel + solution.

Fill in the missing code and choose answers in this web form.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.metrics.regression import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, LassoCV, Lasso
from sklearn.ensemble import RandomForestRegressor

We are working with UCI Wine quality dataset (no need to download it – it's already there, in course repo and in Kaggle Dataset).

In [2]:
data = pd.read_csv('../../data/winequality-white.csv', sep=';')
In [3]:
data.head()
Out[3]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with StandardScaler.

In [5]:
y = data['quality']
X = data.drop('quality', axis=1)

X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.3, 
                                                          random_state=17)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_holdout_scaled = scaler.transform(X_holdout)

Linear regression

Train a simple linear regression model (Ordinary Least Squares).

In [6]:
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train);

Question 1: What are mean squared errors of model predictions on train and holdout sets?

In [7]:
print("Mean squared error (train): %.3f" % mean_squared_error(y_train, linreg.predict(X_train_scaled)))
print("Mean squared error (test): %.3f" % mean_squared_error(y_holdout, linreg.predict(X_holdout_scaled)))
Mean squared error (train): 0.558
Mean squared error (test): 0.584

Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It's handy to use pandas.DataFrame here.

Question 2: Which feature this linear regression model treats as the most influential on wine quality?

In [8]:
linreg_coef = pd.DataFrame({'coef': linreg.coef_, 'coef_abs': np.abs(linreg.coef_)},
                          index=data.columns.drop('quality'))
linreg_coef.sort_values(by='coef_abs', ascending=False)
Out[8]:
coef coef_abs
density -0.665720 0.665720
residual sugar 0.538164 0.538164
volatile acidity -0.192260 0.192260
pH 0.150036 0.150036
alcohol 0.129533 0.129533
fixed acidity 0.097822 0.097822
sulphates 0.062053 0.062053
free sulfur dioxide 0.042180 0.042180
total sulfur dioxide 0.014304 0.014304
chlorides 0.008127 0.008127
citric acid -0.000183 0.000183

Lasso regression

Train a LASSO model with $\alpha = 0.01$ (weak regularization) and scaled data. Again, set random_state=17.

In [9]:
lasso1 = Lasso(alpha=0.01, random_state=17)
lasso1.fit(X_train_scaled, y_train)
Out[9]:
Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=17,
   selection='cyclic', tol=0.0001, warm_start=False)

Which feature is the least informative in predicting wine quality, according to this LASSO model?

In [10]:
lasso1_coef = pd.DataFrame({'coef': lasso1.coef_, 'coef_abs': np.abs(lasso1.coef_)},
                          index=data.columns.drop('quality'))
lasso1_coef.sort_values(by='coef_abs', ascending=False)
Out[10]:
coef coef_abs
alcohol 0.322425 0.322425
residual sugar 0.256363 0.256363
density -0.235492 0.235492
volatile acidity -0.188479 0.188479
pH 0.067277 0.067277
free sulfur dioxide 0.043088 0.043088
sulphates 0.029722 0.029722
chlorides -0.002747 0.002747
fixed acidity -0.000000 0.000000
citric acid -0.000000 0.000000
total sulfur dioxide -0.000000 0.000000

Train LassoCV with random_state=17 to choose the best value of $\alpha$ in 5-fold cross-validation.

In [11]:
alphas = np.logspace(-6, 2, 200)
lasso_cv = LassoCV(random_state=17, cv=5, alphas=alphas)
lasso_cv.fit(X_train_scaled, y_train)
Out[11]:
LassoCV(alphas=array([1.00000e-06, 1.09699e-06, ..., 9.11589e+01, 1.00000e+02]),
    copy_X=True, cv=5, eps=0.001, fit_intercept=True, max_iter=1000,
    n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=17, selection='cyclic', tol=0.0001,
    verbose=False)
In [12]:
lasso_cv.alpha_
Out[12]:
0.0002833096101839324

Question 3: Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?

In [13]:
lasso_cv_coef = pd.DataFrame({'coef': lasso_cv.coef_, 'coef_abs': np.abs(lasso_cv.coef_)},
                          index=data.columns.drop('quality'))
lasso_cv_coef.sort_values(by='coef_abs', ascending=False)
Out[13]:
coef coef_abs
density -0.648161 0.648161
residual sugar 0.526883 0.526883
volatile acidity -0.192049 0.192049
pH 0.146549 0.146549
alcohol 0.137115 0.137115
fixed acidity 0.093295 0.093295
sulphates 0.060939 0.060939
free sulfur dioxide 0.042698 0.042698
total sulfur dioxide 0.012969 0.012969
chlorides 0.006933 0.006933
citric acid -0.000000 0.000000

Question 4: What are mean squared errors of tuned LASSO predictions on train and holdout sets?

In [14]:
print("Mean squared error (train): %.3f" % mean_squared_error(y_train, lasso_cv.predict(X_train_scaled)))
print("Mean squared error (test): %.3f" % mean_squared_error(y_holdout, lasso_cv.predict(X_holdout_scaled)))
Mean squared error (train): 0.558
Mean squared error (test): 0.583

Random Forest

Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.

In [15]:
forest = RandomForestRegressor(random_state=17)
forest.fit(X_train_scaled, y_train)
Out[15]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=17, verbose=0, warm_start=False)

Question 5: What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?

In [16]:
print("Mean squared error (train): %.3f" % mean_squared_error(y_train, forest.predict(X_train_scaled)))
print("Mean squared error (cv): %.3f" % np.mean(np.abs(cross_val_score(forest, X_train_scaled, y_train, 
                                                                       scoring='neg_mean_squared_error'))))
print("Mean squared error (test): %.3f" % mean_squared_error(y_holdout, forest.predict(X_holdout_scaled)))
Mean squared error (train): 0.075
Mean squared error (cv): 0.460
Mean squared error (test): 0.421

Tune the max_features and max_depth hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.

In [17]:
forest_params = {'max_depth': list(range(10, 25)), 
                  'max_features': list(range(6,12))}

locally_best_forest = GridSearchCV(RandomForestRegressor(n_jobs=-1, random_state=17), 
                                 forest_params, 
                                 scoring='neg_mean_squared_error',  
                                 n_jobs=-1, cv=5,
                                  verbose=True)
locally_best_forest.fit(X_train_scaled, y_train)
Fitting 5 folds for each of 90 candidates, totalling 450 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:   18.8s finished
Out[17]:
GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=17, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], 'max_features': [6, 7, 8, 9, 10, 11]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=True)
In [18]:
locally_best_forest.best_params_, locally_best_forest.best_score_
Out[18]:
({'max_depth': 19, 'max_features': 7}, -0.4346879383644381)

Question 6: What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?

In [19]:
print("Mean squared error (cv): %.3f" % np.mean(np.abs(cross_val_score(locally_best_forest.best_estimator_,
                                                        X_train_scaled, y_train, 
                                                        scoring='neg_mean_squared_error'))))
print("Mean squared error (test): %.3f" % mean_squared_error(y_holdout, 
                                                             locally_best_forest.predict(X_holdout_scaled)))
Mean squared error (cv): 0.457
Mean squared error (test): 0.410

Output RF's feature importance. Again, it's nice to present it as a DataFrame.
Question 7: What is the most important feature, according to the Random Forest model?

In [20]:
rf_importance = pd.DataFrame(locally_best_forest.best_estimator_.feature_importances_, 
                             columns=['coef'], index=data.columns[:-1]) 
rf_importance.sort_values(by='coef', ascending=False)
Out[20]:
coef
alcohol 0.224432
volatile acidity 0.119393
free sulfur dioxide 0.116147
pH 0.072806
total sulfur dioxide 0.071318
residual sugar 0.070160
density 0.069367
chlorides 0.067982
fixed acidity 0.064268
citric acid 0.062945
sulphates 0.061184

Make conclusions about the perdormance of the explored 3 models in this particular prediction task.

The depency of wine quality on other features in hand is, presumable, non-linear. So Random Forest works better in this task.