How to overfit your ML model?

by Li Shen, Ph.D.

Icahn School of Medicine at Mount Sinai

Updated: 2018-05-11

In [1]:
%pylab inline
import numpy as np
Populating the interactive namespace from numpy and matplotlib

Generate a purely random dataset using standard Gaussian distribution. It has 100 samples and 20,000 features. 50 samples are randomly assigned label=0 and the other 50 samples are assigned label=1.

In [3]:
x_train = np.random.randn(100, 20000)
x_train.shape
Out[3]:
(100, 20000)
In [4]:
y_train = np.concatenate([np.repeat([0], 50), np.repeat([1], 50)])
y_train.shape
Out[4]:
(100,)

Perform feature selection on the entire data set using one-way ANOVA.

In [10]:
from sklearn.feature_selection import f_classif
f_train, p_train = f_classif(x_train, y_train)
print((p_train < .01).sum())
209

209 features passed p=0.01 cutoff, which is roughly the same as 20,000 x 0.01=200.

In [12]:
feature_sel_mask = (p_train < .01)
x_train_sel = x_train[:, feature_sel_mask]
x_train_sel.shape
Out[12]:
(100, 209)

Perform 10-fold cross-validation on the train set to find the best classifier that can distinguish the two classes. I use random forest, which is considered (by some people) as a robust classifier that is less likely to overfit.

In [18]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=10, 
                                   scoring='roc_auc')
random_search.fit(x_train_sel, y_train)
print('Best 10-fold CV AUC score:', random_search.best_score_)
Best 10-fold CV AUC score: 0.992

Take a closer look at the cross-validation results.

In [32]:
auc_mean_lst = [ g.mean_validation_score for g in random_search.grid_scores_]
auc_mean_lst = np.array(auc_mean_lst)
np.argmax(auc_mean_lst)
Out[32]:
28
In [36]:
best_clf_res = random_search.grid_scores_[28]
print('mean CV score:', best_clf_res.mean_validation_score)
print('all CV score:', best_clf_res.cv_validation_scores)
print('std CV score:', best_clf_res.cv_validation_scores.std())
print('Random forest params:\n', best_clf_res.parameters)
mean CV score: 0.992
all CV score: [1.   1.   1.   1.   1.   1.   1.   0.92 1.   1.  ]
std CV score: 0.023999999999999987
Random forest params:
 {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 1, 'min_samples_leaf': 8, 'min_samples_split': 3}

Can unsupvised feature selection save the day?

It is obvious that using supervised feature selection such as ANOVA can lead to extreme overfitting. Then what about using unsupervised feature selection? Can they totally avoid overfitting?

First, I'll use a method that simply selects features with large variance.

In [12]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=1.4)
x_train_var = selector.fit_transform(x_train)
x_train_var.shape
Out[12]:
(100, 77)
In [22]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 300
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=10, 
                                   scoring='roc_auc')
random_search.fit(x_train_var, y_train)
print('Best 10-fold CV AUC score:', random_search.best_score_)
Best 10-fold CV AUC score: 0.7040000000000001

It looks better. The best AUC score becomes 0.7. However, it is still higher than 0.5.

Let's try another method: the principal component analysis.

In [37]:
from sklearn.decomposition import PCA
pca = PCA(n_components=20)
x_train_pca = pca.fit_transform(x_train)
x_train_pca.shape
Out[37]:
(100, 20)
In [38]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 300
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=10, 
                                   scoring='roc_auc')
random_search.fit(x_train_pca, y_train)
print('Best 10-fold CV AUC score:', random_search.best_score_)
Best 10-fold CV AUC score: 0.7020000000000001

The AUC is 0.70, similar to variance based feature selection.

Do feature selection separately for each fold?

It is wise to do the feature selection independently from test set. Do it on train set only and then apply the same procedure on the test set. Does this completely solve our problem?

In [25]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_classif, SelectPercentile

clf = Pipeline([
    ('feature_selection', SelectPercentile(f_classif, percentile=1)),
    ('classification', RandomForestClassifier(n_estimators=20))
])

# specify parameters and distributions to sample from
param_dist = {"classification__max_depth": [3, None],
              "classification__max_features": sp_randint(1, 11),
              "classification__min_samples_split": sp_randint(2, 11),
              "classification__min_samples_leaf": sp_randint(1, 11),
              "classification__bootstrap": [True, False],
              "classification__criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=10, 
                                   scoring='roc_auc')
random_search.fit(x_train, y_train)
print('Best 10-fold CV AUC score:', random_search.best_score_)
Best 10-fold CV AUC score: 0.608

The AUC score of 0.61 is certainly much better than 0.99. This indicates the importance of doing feature selection after train/test split.

NOTE: Why did unsupervised feature selection lead to overfitting?

In case you are wondering how this happened, notice that I deliberatly generated a dataset with high dimension (p) and small sample size (N). When p>>N, even a noise may happen to have discriminative power. When cross-validation is applied to the entire dataset for model selection, such noises will be picked up to create an overfit model.

Unfortunately, this is a common case in biomedical sciences. For example, a gene expression data typically has 20,000 features and 100's of samples. For medical imaging, even a modest image size of 200x200 will give a dimension of 40,000. Because data are scarce, many researchers opt to evaluate their models using cross-validation. If that is not handled properly, overfitting happens and misleading results are reported.

How do we fix this?

Option 1: use an independent test set

In [33]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_classif, SelectPercentile

clf = Pipeline([
    ('feature_selection', SelectPercentile(f_classif, percentile=1)),
    ('classification', RandomForestClassifier(n_estimators=20))
])

# specify parameters and distributions to sample from
param_dist = {"classification__max_depth": [3, None],
              "classification__max_features": sp_randint(1, 11),
              "classification__min_samples_split": sp_randint(2, 11),
              "classification__min_samples_leaf": sp_randint(1, 11),
              "classification__bootstrap": [True, False],
              "classification__criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
rs_indep_test = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=10, 
                                   scoring='roc_auc')
rs_indep_test.fit(x_train, y_train)
print('Best 10-fold CV AUC score:', rs_indep_test.best_score_)
Best 10-fold CV AUC score: 0.6680000000000001
In [36]:
x_test = np.random.randn(100, 20000)
y_test = np.concatenate([np.repeat([0], 50), np.repeat([1], 50)])
test_score = rs_indep_test.score(x_test, y_test)
print('Test AUC score:', test_score)
Test AUC score: 0.49960000000000004

On an independent test set, the AUC score reduces to less than 0.5

Option 2: use cross-validation in a nested fashion

It's not uncommon that we don't have an independent test set, especially in biomedical sciences. We want to make use of every bit of our dataset for both training and testing. To deal with this, we can use nested cross-validation.

In nested CV, we embed CVs within a CV. For each fold of the outer CV, we perform a K-fold CV on the train set and then evaluate on the test set. This way, we can make use of all the samples of our dataset for testing and make sure they are all genuine test samples.

To save time, I use 3-fold CV for both inner and outer CVs.

In [32]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_classif, SelectPercentile
from sklearn.model_selection import cross_val_score

clf = Pipeline([
    ('feature_selection', SelectPercentile(f_classif, percentile=1)),
    ('classification', RandomForestClassifier(n_estimators=20))
])

# specify parameters and distributions to sample from
param_dist = {"classification__max_depth": [3, None],
              "classification__max_features": sp_randint(1, 11),
              "classification__min_samples_split": sp_randint(2, 11),
              "classification__min_samples_leaf": sp_randint(1, 11),
              "classification__bootstrap": [True, False],
              "classification__criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=3, 
                                   scoring='roc_auc')
nested_cv_score = cross_val_score(random_search, x_train, y_train, cv=3)
print('Nested 3-fold CV AUC score:', nested_cv_score)
Nested 3-fold CV AUC score: [0.49134948 0.5017301  0.484375  ]

All 3 folds show test AUC scores close to 0.5.