Author: Yury Kashnitskiy (@yorko). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose. This material is a translated version of the Capstone project (by the same author) from specialization "Machine learning and data analysis" by Yandex and MIPT. No solutions shared.
Finally, we are going to train classification models, compare several algorithms via cross-validation, and figure out which session's parameters (session_length и window_size) it is better to use. Also, for the chosen algorithm, we will plot learning curves (which show the dependecy of model performance on the amount of training data) and validation curves (which show the dependency of model performance on one of it's hyperparameters).
Week 4 roadmap:
You might find following links useful:
Your task
# pip install watermark
%load_ext watermark
%watermark -v -m -p numpy,scipy,pandas,matplotlib,statsmodels,sklearn -g
CPython 3.7.0 IPython 7.1.1 numpy 1.15.4 scipy 1.1.0 pandas 0.23.4 matplotlib 3.0.2 statsmodels 0.9.0 sklearn 0.20.0 compiler : GCC 7.3.0 system : Linux release : 4.17.14-041714-generic machine : x86_64 processor : x86_64 CPU cores : 12 interpreter: 64bit Git hash : d2fa7c7dfca896055c40b5fea2f513a384ff1fda
from __future__ import division, print_function
# disable Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
from time import time
import itertools
import os
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
from matplotlib import pyplot as plt
import pickle
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
# Change the path to data
PATH_TO_DATA = '../../data/capstone_user_identification/'
Load X_sparse_10users and y_10users objects serialized earlier, which correspond to 10 users data.
#You might wanna check the `encoding` param if you face any error while opening the .pkl files..
with open(os.path.join(PATH_TO_DATA,
'X_sparse_10users.pkl'), 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(os.path.join(PATH_TO_DATA,
'y_10users.pkl'), 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
There are more than 14 thousand sessions and almost 5 thousand unique websites.
X_sparse_10users.shape
(14061, 4913)
Split the data into two parts. We are going to use the first part for cross-validation, second part will be used to evaluate performance of the model that we will end up with after cross-validation.
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
Define cross-validation: 3-fold, with shuffle, random_state=17 – for reproducibility.
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
Utility function to plot validation curves after running GridSearchCV
(or RandomizedCV
).
def plot_validation_curves(param_values, grid_cv_results_):
train_mu, train_std = grid_cv_results_['mean_train_score'], grid_cv_results_['std_train_score']
valid_mu, valid_std = grid_cv_results_['mean_test_score'], grid_cv_results_['std_test_score']
train_line = plt.plot(param_values, train_mu, '-', label='train', color='green')
valid_line = plt.plot(param_values, valid_mu, '-', label='test', color='red')
plt.fill_between(param_values, train_mu - train_std, train_mu + train_std, edgecolor='none',
facecolor=train_line[0].get_color(), alpha=0.2)
plt.fill_between(param_values, valid_mu - valid_std, valid_mu + valid_std, edgecolor='none',
facecolor=valid_line[0].get_color(), alpha=0.2)
plt.legend()
1. Train KNeighborsClassifier
with 100 nearest neighbours (leave other parameters default values, only set n_jobs = -1
for parallelization) and compare model's mean accuracy during 3-fold cross-validation (for reproducibility use skf
object) on (X_train, y_train)
and model's accuracy on (X_valid, y_valid)
.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier ''' YOUR CODE IS HERE '''
Question 1. Evaluate KNeighborsClassifier's mean accuracy during cross-validation and model's accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.
''' YOUR CODE IS HERE '''
2. Train a random forest (RandomForestClassifier
) consisting of 100 trees (for reproducibility use random_state
=17). Compare model's OOB-score on and its accuracy on (X_valid, y_valid)
. Use n_jobs = -1
for parallelization.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier ''' YOUR CODE IS HERE '''
Question 2. Evaluate RandomForestClassifier
Out-of-Bag aka OOB score and its accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.
''' YOUR CODE IS HERE '''
3. Train logistic regression with default C value and random_state
=17. Compare model's mean accuracy during 3-fold cross-validation (don't forget to use skf
object) on (X_train, y_train)
and model's accuracy on (X_valid, y_valid)
. Use n_jobs = -1
for parallelization.
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
logit = LogisticRegression ''' YOUR CODE IS HERE '''
Read the documentation for LogisticRegressionCV. Logistic regression is well studied and there are algorithms for fast parameter C
search (faster than using GridSearchCV
).
Using LogisticRegressionCV
find optimal C value. Fisrt try wider range: 10 values from 1e-4 up to 1e2 using logspace
from NumPy
. Specify multi_class
='multinomial' and random_state
=17 for LogisticRegressionCV
. For cross-validation use skf
object created earlier. Use n_jobs
=-1 for parallelization.
Plot validation curves for parameter C
.
%%time
logit_c_values1 = np.logspace(-4, 2, 10)
logit_grid_searcher1 = LogisticRegressionCV ''' YOUR CODE IS HERE '''
logit_grid_searcher1.fit(X_train, y_train)
Mean accuracy during cross-validation for each of 10 C
values.
logit_mean_cv_scores1 = ''' YOUR CODE IS HERE '''
Print the best accuracy on cross-validation and corresponding value of C
.
''' YOUR CODE IS HERE '''
Plot Accuracy vs. C
dependency graph on cross-validation.
plt.plot(logit_c_values1, logit_mean_cv_scores1);
Now, do the same again but search C
values in range np.linspace
(0.1, 7, 20). Plot the validation curves and find the best accuracy on cross-validation.
%%time
logit_c_values2 = np.linspace(0.1, 7, 20)
logit_grid_searcher2 = LogisticRegressionCV ''' YOUR CODE IS HERE '''
logit_grid_searcher2.fit(X_train, y_train)
Mean accuracy during cross-validation for each of 10 C
values.
''' YOUR CODE IS HERE '''
Print the best accuracy on cross-validation and corresponding value of C.
''' YOUR CODE IS HERE '''
Plot Accuracy vs. C
dependency graph on cross-validation.
plt.plot(logit_c_values2, logit_mean_cv_scores2);
Print logistic regressoin's accuracy with the best C
value on (X_valid, y_valid)
.
logit_cv_acc = accuracy_score ''' YOUR CODE IS HERE '''
Question 3. Evaluate model's mean accuracy for logit_grid_searcher2
on cross-validation using the best C
and its accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.
''' YOUR CODE IS HERE '''
4. Train SVM (LinearSVC
) with C
=1 and random_state
=17. Compare model's mean accuracy during cross-validation (don't forget to use skf
object) and model's ccuracy on (X_valid, y_valid)
.
from sklearn.svm import LinearSVC
svm = LinearSVC ''' YOUR CODE IS HERE '''
Using GridSearchCV
find optimal C value for SVM. Fisrt try wider range: 10 values from 1e-4 up to 1e2 using linspace
from NumPy. Plot the validation curves.
%%time
svm_params1 = {'C': np.linspace(1e-4, 1e4, 10)}
svm_grid_searcher1 = GridSearchCV ''' YOUR CODE IS HERE '''
svm_grid_searcher1.fit(X_train, y_train)
Print the best accuracy on cross-validation and corresponding value of C
.
''' YOUR CODE IS HERE '''
Plot Accuracy vs. C dependency graph on cross-validation.
plot_validation_curves(svm_params1['C'], svm_grid_searcher1.cv_results_)
But we remember that using deafault regularization parameter (C
=1) on cross-validation we get a higher accuracy. That's the case (not rare) of a possibility to make a mistake and searching parameters in a wrong range (the reason is that we took a uniform grid on a large scale and missed optimal interval of C
values). It is more meaningful to search C
near 1, in addition, model trains faster than with higher values of C
.
Using GridSearchCV
find optimal C
value for SVM in range(1e-3, 1), 30 values, use linspace
from NumPy. Plt the validation curves.
%%time
svm_params2 = {'C': np.linspace(1e-3, 1, 30)}
svm_grid_searcher2 = GridSearchCV ''' YOUR CODE IS HERE '''
svm_grid_searcher2.fit(X_train, y_train)
Print the best accuracy on cross-validation and corresponding value of C
.
''' YOUT CODE IS HERE '''
Plot Accuracy vs. C dependency graph on cross-validation.
plot_validation_curves(svm_params2['C'], svm_grid_searcher2.cv_results_)
Print LinearSVC
's accuracy with the best C
value on (X_valid, y_valid).
svm_cv_acc = accuracy_score ''' YOUR CODE IS HERE '''
Question 4. Evaluate model's mean accuracy for svm_grid_searcher2
on cross-validation using the best C
and its accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.
''' YOUR CODE IS HERE '''
Let's take LinearSVC
since it performed best on cross-validation in part 1 and check its performance on 8 datasets of 10 users (with different combiantions of session_length
and window_size
). Since there are much more computations, we will not search regularization parameter C
each time.
Write the model_assessment
function with the specification provided below. Pay your attention to all details, e.g. train_test_split
should be stratified. Don't forget random_state
anywhere.
def model_assessment(estimator, path_to_X_pickle, path_to_y_pickle, cv, random_state=17, test_size=0.3):
'''
Estimates CV-accuracy for (1 - test_size) share of (X_sparse, y)
loaded from path_to_X_pickle and path_to_y_pickle and holdout accuracy for (test_size) share of (X_sparse, y).
The split is made with stratified train_test_split with params random_state and test_size.
:param estimator – Scikit-learn estimator (classifier or regressor)
:param path_to_X_pickle – path to pickled sparse X (instances and their features)
:param path_to_y_pickle – path to pickled y (responses)
:param cv – cross-validation as in cross_val_score (use StratifiedKFold here)
:param random_state – for train_test_split
:param test_size – for train_test_split
:returns mean CV-accuracy for (X_train, y_train) and accuracy for (X_valid, y_valid) where (X_train, y_train)
and (X_valid, y_valid) are (1 - test_size) and (testsize) shares of (X_sparse, y).
'''
''' YOUR CODE IS HERE '''
Double-check that the function is working.
model_assessment(svm_grid_searcher2.best_estimator_,
os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'),
os.path.join(PATH_TO_DATA, 'y_10users.pkl'), skf, random_state=17, test_size=0.3)
Apply model_assessment
function for the best algorithm from the previous part (namely, svm_grid_searcher2.best_estimator_
) and 9 datasets with different combinations of session_length
and window_size
of 10 users. Print session_length
and window_size
parameters in the loop as well as an output of the model_assessment
function.
It's handy if the model_assessment
function returns execution time as a third output argument. It took 20 sec to execute this code snippet on my laptop. But with 150 users dataset, each iteration takes a couple of minutes.
Here, for the convinience it worth to create copies of pickle-files X_sparse_10users.pkl
, X_sparse_150users.pkl
, y_10users.pkl
and y_150users.pkl
adding s10_w10
to their names, which means session length of 10 and window width of 10.
# Won't work on non-Linux based Machines (Basically it's creating copies of the files)
!cp $PATH_TO_DATA/X_sparse_10users.pkl $PATH_TO_DATA/X_sparse_10users_s10_w10.pkl
!cp $PATH_TO_DATA/X_sparse_150users.pkl $PATH_TO_DATA/X_sparse_150users_s10_w10.pkl
!cp $PATH_TO_DATA/y_10users.pkl $PATH_TO_DATA/y_10users_s10_w10.pkl
!cp $PATH_TO_DATA/y_150users.pkl $PATH_TO_DATA/y_150users_s10_w10.pkl
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
if window_size <= session_length:
path_to_X_pkl = ''' YOUR CODE IS HERE '''
path_to_y_pkl = ''' YOUR CODE IS HERE '''
print ''' YOUR CODE IS HERE '''
Question 5. Evaluate LinearSVC
's accuracy with optimal C
on X_sparse_10users_s15_w5
dataset. Write model's mean accuracy on cross-validation and its accuracy on validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.
''' YOUR CODE IS HERE '''
Comment on the results. Compare mean accuracy on cross-validation and on validation dataset using the following combinations of parameters(session_length, window_size
): (5,5), (7,7) и (10,10). On average laptop it could take up to an hour. After all, it's data science :).
Make a conclusion about how accuracy depends on session length and window width.
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in [(5,5), (7,7), (10,10)]:
path_to_X_pkl = ''' YOUR CODE IS HERE '''
path_to_y_pkl = ''' YOUR CODE IS HERE '''
print ''' YOUR CODE IS HERE '''
Question 6. Evaluate LinearSVC
's accuracy with optimal C
value on X_sparse_150users
. Write model's accuracy on cross-validation and on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.
''' YOUR CODE IS HERE '''
Since it may dissapoint that accuracy at multiclass classification problem on 150 users dataset is low, let's joy the fact that some particular user could be identified quite well.
Load X_sparse_150users
and y_150users
objects serialized earlier which correspond to 150 users dataset with parameters (session_length, window_size) = (10,10). Split them into two parts: 70% train data and 30% validation data.
with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'rb') as X_sparse_150users_pkl:
X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'rb') as y_150users_pkl:
y_150users = pickle.load(y_150users_pkl)
X_train_150, X_valid_150, y_train_150, y_valid_150 = train_test_split(X_sparse_150users,
y_150users, test_size=0.3,
random_state=17, stratify=y_150users)
Train LogisticRegressionCV
with single C
value (take the best C
value on cross-validation in part 1, use an accurate value, not an approximate one). Now we are going to solve 150 tasks One-vs-All, hence set multi_class
='ovr'. As usual, set n_jobs=-1
and random_state
=17 where it is possible (this training might take up to 20 min).
%%time
logit_cv_150users = LogisticRegressionCV ''' YOUR CODE IS HERE '''
logit_cv_150users.fit(X_train_150, y_train_150)
Compare mean accuracy on cross-validation for each user identification problem separately.
cv_scores_by_user = {}
for user_id in logit_cv_150users.scores_:
print('User {}, CV score: {}'.format ''' YOUR CODE IS HERE '''
Accuracy could seem impressive, but, perhaps, we forget about class disbalance and high accuracy could just be obtained with a constant prediction. Evaluate the difference between accuracy on cross-validation (we've just evaluated using LogisticRegressionCV
) and the fraction of labels which differ from user_id (that's the accuracy we get if classificator always says that it is not the $i$-th user in classification task $i$-vs-All) for each user in y_train_150
.
class_distr = np.bincount(y_train_150.astype('int'))
for user_id in np.unique(y_train_150):
''' YOUR CODE IS HERE '''
num_better_than_default = (np.array(list(acc_diff_vs_constant.values())) > 0).sum()
Question 7. Evaluate the fraction of users where LogisticRegressionCV
performs better than just a constant prediction. Round your answer up to the third digit after the decimal point.
''' YOUR CODE IS HERE '''
Next step is to plot learning curves for a particular user, let's say for 128-th. Make a new binary vector using y_150users
, its values are 1 or 0 depending on whether user_id=128 or not.
y_binary_128 = ''' YOUR CODE IS HERE '''
from sklearn.model_selection import learning_curve
def plot_learning_curve(val_train, val_test, train_sizes,
xlabel='Training Set Size', ylabel='score'):
def plot_with_err(x, data, **kwargs):
mu, std = data.mean(1), data.std(1)
lines = plt.plot(x, mu, '-', **kwargs)
plt.fill_between(x, mu - std, mu + std, edgecolor='none',
facecolor=lines[0].get_color(), alpha=0.2)
plot_with_err(train_sizes, val_train, label='train')
plot_with_err(train_sizes, val_test, label='valid')
plt.xlabel(xlabel); plt.ylabel(ylabel)
plt.legend(loc='lower right');
Evaluate accuracy on cross-validation at "user128-vs-All" task depending on train size. It would be useful to check the documentation for learning_curve
.
%%time
train_sizes = np.linspace(0.25, 1, 20)
estimator = svm_grid_searcher2.best_estimator_
n_train, val_train, val_test = learning_curve ''' YOUR CODE IS HERE '''
plot_learning_curve(val_train, val_test, n_train,
xlabel='train_size', ylabel='accuracy')
Make a conclusion whether new data helps to improve the model's accuracy with the same problem definition.
Next week, we will recall linear models trained with stochastic gradient descend, and enjoy how faster they work.