In the last few lectures, we learned how to use hold-out "test" sets and cross-validation to gain appropriate estimates of a model's performance on unseen data. There, the focus was on choosing a good "complexity" parameter, such as the depth of a decision tree. In this lecture, we'll instead show how to use cross-validation to get an estimate of which columns in the data should or should not be included in a model. It's very common in practice that not all columns will be used in the best model, and many, many machine learning reseachers devote their careers to studying the problem of how to intelligently and automatically choose only the most relevant columns for models. In the literature, this problem is usually called feature selection. In this lecture, we'll take a quick look at how feature selection can improve model performance.
For this demonstration, we'll switch from decision trees to logistic regression. Logistic regression is a form of regression modeling well-suited for predicting probabilities and class labels.
Let's begin by running some familiar blocks of code, in which we load our core libraries, read in the data, split the data, and clean the data.
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
# assumes that you have run the function retrieve_data()
# from "Introduction to ML in Practice" in ML_3.ipynb
titanic = pd.read_csv("data.csv")
titanic
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
882 | 0 | 2 | Rev. Juozas Montvila | male | 27.0 | 0 | 0 | 13.0000 |
883 | 1 | 1 | Miss. Margaret Edith Graham | female | 19.0 | 0 | 0 | 30.0000 |
884 | 0 | 3 | Miss. Catherine Helen Johnston | female | 7.0 | 1 | 2 | 23.4500 |
885 | 1 | 1 | Mr. Karl Howell Behr | male | 26.0 | 0 | 0 | 30.0000 |
886 | 0 | 3 | Mr. Patrick Dooley | male | 32.0 | 0 | 0 | 7.7500 |
887 rows × 8 columns
from sklearn.model_selection import train_test_split
np.random.seed(1111)
train, test = train_test_split(titanic, test_size = 0.2)
from sklearn import preprocessing
def prep_titanic_data(data_df):
df = data_df.copy()
le = preprocessing.LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df = df.drop(['Name'], axis = 1)
X = df.drop(['Survived'], axis = 1)
y = df['Survived']
return(X, y)
X_train, y_train = prep_titanic_data(train)
X_test, y_test = prep_titanic_data(test)
Deploying logistic regression is easy, and uses exactly the same API as the decision tree classifier. Let's go ahead and use cross-validation to estimate the predictive performance of the model.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
LR = LogisticRegression()
cross_val_score(LR, X_train, y_train, cv = 5).mean()
0.7940365597842375
Is this the best we can do? If you've studied logistic regression before, you may know that using lots of columns doesn't always help -- due to multicollinearity, the model's predictive performance can actually suffer. This is actually another aspect of overfitting. Adding more columns makes the model more flexible, and we've seen that that is not always beneficial. So, a natural question is whether we can achieve the same (or better?) model performance by using only a subset of the columns.
It's easy to train a model on a subset of the data. For example:
cols = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
print("training with columns " + str(cols))
LR = LogisticRegression()
cross_val_score(LR, X_train[cols], y_train, cv = 5).mean()
training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
0.8025072420337629
Interesting! Excluding the last column (Fare) actually improved our CV score slightly.
Now, let's write a function that will let us do this systematically. Our function will use cross-validation to avoid "peeking" at the test set.
def check_column_score(cols):
"""
Trains and evaluates a model via cross-validation on the columns of the data
with selected indices
"""
print("training with columns " + str(cols))
LR = LogisticRegression()
return cross_val_score(LR, X_train[cols], y_train, cv = 5).mean()
We can now check multiple combinations simultaneously. In a real problem, you might check all possible combinations, and in the Penguins data set, for example, this would be possible. In this lecture, however, we'll just compare a few.
combos = [['Sex', 'Age', 'Fare'],
['Pclass', 'Sex', 'Age'],
['Pclass', 'Parents/Children Aboard'],
['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'],
['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']]
for cols in combos:
x = check_column_score(cols)
print("CV score is " + str(np.round(x, 3)))
training with columns ['Sex', 'Age', 'Fare'] CV score is 0.773 training with columns ['Pclass', 'Sex', 'Age'] CV score is 0.795 training with columns ['Pclass', 'Parents/Children Aboard'] CV score is 0.671 training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'] CV score is 0.803 training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'] CV score is 0.794
Note that the model that uses all the available columns achieves only the third-highest CV score. The model with the highest CV score uses all columns except for "Fare."
Now let's see how each of these models perform on the test set.
def test_column_score(cols):
"""
Trains and evaluates a model on the test set using the columns of the data
with selected indices
"""
print("testing with columns " + str(cols))
LR = LogisticRegression()
LR.fit(X_train[cols], y_train)
return LR.score(X_test[cols], y_test)
for cols in combos:
x = test_column_score(cols)
print("test score is " + str(np.round(x, 3)))
testing with columns ['Sex', 'Age', 'Fare'] test score is 0.82 testing with columns ['Pclass', 'Sex', 'Age'] test score is 0.803 testing with columns ['Pclass', 'Parents/Children Aboard'] test score is 0.742 testing with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'] test score is 0.831 testing with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'] test score is 0.82
Indeed, we achieved a higher prediction score on the test set by ignoring the "Fare" column completely.
There are a number of sophisticated algorithms for automated feature selection, but we won't go further into this topic in this course.