Feature Selection¶

In the last few lectures, we learned how to use hold-out "test" sets and cross-validation to gain appropriate estimates of a model's performance on unseen data. There, the focus was on choosing a good "complexity" parameter, such as the depth of a decision tree. In this lecture, we'll instead show how to use cross-validation to get an estimate of which columns in the data should or should not be included in a model. It's very common in practice that not all columns will be used in the best model, and many, many machine learning reseachers devote their careers to studying the problem of how to intelligently and automatically choose only the most relevant columns for models. In the literature, this problem is usually called feature selection. In this lecture, we'll take a quick look at how feature selection can improve model performance.

For this demonstration, we'll switch from decision trees to logistic regression. Logistic regression is a form of regression modeling well-suited for predicting probabilities and class labels.

Let's begin by running some familiar blocks of code, in which we load our core libraries, read in the data, split the data, and clean the data.

In [1]:

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

In [2]:

# assumes that you have run the function retrieve_data() 
# from "Introduction to ML in Practice" in ML_3.ipynb
titanic = pd.read_csv("data.csv")
titanic

Out[2]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
0	0	3	Mr. Owen Harris Braund	male	22.0	1	0	7.2500
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	0	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	0	53.1000
4	0	3	Mr. William Henry Allen	male	35.0	0	0	8.0500
...	...	...	...	...	...	...	...	...
882	0	2	Rev. Juozas Montvila	male	27.0	0	0	13.0000
883	1	1	Miss. Margaret Edith Graham	female	19.0	0	0	30.0000
884	0	3	Miss. Catherine Helen Johnston	female	7.0	1	2	23.4500
885	1	1	Mr. Karl Howell Behr	male	26.0	0	0	30.0000
886	0	3	Mr. Patrick Dooley	male	32.0	0	0	7.7500

887 rows × 8 columns

In [3]:

from sklearn.model_selection import train_test_split

np.random.seed(1111)
train, test = train_test_split(titanic, test_size = 0.2)

In [4]:

from sklearn import preprocessing
def prep_titanic_data(data_df):
    df = data_df.copy()
    le = preprocessing.LabelEncoder()
    df['Sex'] = le.fit_transform(df['Sex'])
    df = df.drop(['Name'], axis = 1)
    
    X = df.drop(['Survived'], axis = 1)
    y = df['Survived']
        
    return(X, y)

X_train, y_train = prep_titanic_data(train)
X_test,  y_test  = prep_titanic_data(test)

Deploying logistic regression is easy, and uses exactly the same API as the decision tree classifier. Let's go ahead and use cross-validation to estimate the predictive performance of the model.

In [6]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

LR = LogisticRegression()
cross_val_score(LR, X_train, y_train, cv = 5).mean()

Out[6]:

0.7940365597842375

Is this the best we can do? If you've studied logistic regression before, you may know that using lots of columns doesn't always help -- due to multicollinearity, the model's predictive performance can actually suffer. This is actually another aspect of overfitting. Adding more columns makes the model more flexible, and we've seen that that is not always beneficial. So, a natural question is whether we can achieve the same (or better?) model performance by using only a subset of the columns.

It's easy to train a model on a subset of the data. For example:

In [7]:

cols = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
print("training with columns " + str(cols))

LR = LogisticRegression()
cross_val_score(LR, X_train[cols], y_train, cv = 5).mean()

training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']

Out[7]:

0.8025072420337629

Interesting! Excluding the last column (Fare) actually improved our CV score slightly.

Systematic Feature Selection¶

Now, let's write a function that will let us do this systematically. Our function will use cross-validation to avoid "peeking" at the test set.

In [8]:

def check_column_score(cols):
    """
    Trains and evaluates a model via cross-validation on the columns of the data
    with selected indices
    """
    print("training with columns " + str(cols))

    LR = LogisticRegression()
    return cross_val_score(LR, X_train[cols], y_train, cv = 5).mean()    

We can now check multiple combinations simultaneously. In a real problem, you might check all possible combinations, and in the Penguins data set, for example, this would be possible. In this lecture, however, we'll just compare a few.

In [10]:

combos = [['Sex', 'Age', 'Fare'],
          ['Pclass', 'Sex', 'Age'],
          ['Pclass', 'Parents/Children Aboard'],
          ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'],
          ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']]

for cols in combos: 
    x = check_column_score(cols)
    print("CV score is " + str(np.round(x, 3)))

training with columns ['Sex', 'Age', 'Fare']
CV score is 0.773
training with columns ['Pclass', 'Sex', 'Age']
CV score is 0.795
training with columns ['Pclass', 'Parents/Children Aboard']
CV score is 0.671
training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
CV score is 0.803
training with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']
CV score is 0.794

Note that the model that uses all the available columns achieves only the third-highest CV score. The model with the highest CV score uses all columns except for "Fare."

Now let's see how each of these models perform on the test set.

In [11]:

def test_column_score(cols):
    """
    Trains and evaluates a model on the test set using the columns of the data
    with selected indices
    """
    print("testing with columns " + str(cols))
    LR = LogisticRegression()
    LR.fit(X_train[cols], y_train)
    return LR.score(X_test[cols], y_test)

In [12]:

for cols in combos: 
    x = test_column_score(cols)
    print("test score is " + str(np.round(x, 3)))

testing with columns ['Sex', 'Age', 'Fare']
test score is 0.82
testing with columns ['Pclass', 'Sex', 'Age']
test score is 0.803
testing with columns ['Pclass', 'Parents/Children Aboard']
test score is 0.742
testing with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
test score is 0.831
testing with columns ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']
test score is 0.82

Indeed, we achieved a higher prediction score on the test set by ignoring the "Fare" column completely.

There are a number of sophisticated algorithms for automated feature selection, but we won't go further into this topic in this course.