version 0.2, May 2016
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham
import pandas as pd
import numpy as np
import zipfile
with zipfile.ZipFile('../datasets/titanic.csv.zip', 'r') as z:
f = z.open('titanic.csv')
titanic = pd.read_csv(f, sep=',', index_col=0)
titanic.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic.Age.fillna(titanic.Age.median(), inplace=True)
titanic.loc[titanic.Embarked.isnull(), 'Embarked'] = titanic.Embarked.mode().values
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)
titanic['Age2'] = titanic['Age'] ** 2
titanic['Age3'] = titanic['Age'] ** 3
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression(C=1e9)
features = ['Pclass', 'Age', 'Age2', 'Age3', 'Parch', 'SibSp', 'Sex_Female', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
X = titanic[list(features)]
y = titanic['Survived']
pd.Series(cross_val_score(logreg, X, y, cv=10, scoring='accuracy')).describe()
count 10.000000 mean 0.716350 std 0.082241 min 0.611111 25% 0.630150 50% 0.735955 75% 0.772472 max 0.829545 dtype: float64
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by $$\mathrm{Var}[X] = p(1 - p)$$ so we can select using the threshold .8 * (1 - .8):
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit(X)
sel.variances_, sel.get_support()
(array([ 6.98230591e-01, 1.69322249e+02, 8.01186989e+05, 3.31010522e+09, 6.48999031e-01, 1.21467827e+00, 2.28218083e-01, 1.53000261e-01, 7.89513794e-02, 1.99362373e-01]), array([ True, True, True, True, True, True, True, False, False, True], dtype=bool))
X_sel = sel.transform(X)
features_sel = np.array(features)[sel.get_support()]
print(np.array(features)[~sel.get_support()])
['Embarked_C' 'Embarked_Q']
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()
count 10.000000 mean 0.701693 std 0.082188 min 0.611111 25% 0.617978 50% 0.709613 75% 0.758427 max 0.820225 dtype: float64
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:
using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator
from sklearn.feature_selection import SelectKBest
sel = SelectKBest(k=8)
sel.fit(X, y)
sel.get_support()
array([ True, True, True, False, True, True, True, True, False, True], dtype=bool)
print(np.array(features)[~sel.get_support()])
['Age3' 'Embarked_Q']
print(np.array(features)[sel.get_support()])
['Pclass' 'Age' 'Age2' 'Parch' 'SibSp' 'Sex_Female' 'Embarked_C' 'Embarked_S']
X_sel = sel.transform(X)
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()
count 10.000000 mean 0.804803 std 0.026880 min 0.766667 25% 0.786517 50% 0.793258 75% 0.828652 max 0.842697 dtype: float64
There is still the question of how to select the parameter k
from sklearn.feature_selection import SelectPercentile, f_classif
sel = SelectPercentile(f_classif, percentile=50)
sel.fit(X, y)
sel.get_support()
array([ True, False, False, False, True, False, True, True, False, True], dtype=bool)
print(np.array(features)[~sel.get_support()])
['Age' 'Age2' 'Age3' 'SibSp' 'Embarked_Q']
print(np.array(features)[sel.get_support()])
['Pclass' 'Parch' 'Sex_Female' 'Embarked_C' 'Embarked_S']
X_sel = sel.transform(X)
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()
count 10.000000 mean 0.777797 std 0.021426 min 0.741573 25% 0.766667 50% 0.774004 75% 0.786517 max 0.820225 dtype: float64
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
RFECV performs RFE in a cross-validation loop to find the optimal number of features.
from sklearn.feature_selection import RFE
sel = RFE(estimator=logreg, n_features_to_select=6)
sel.fit(X, y)
RFE(estimator=LogisticRegression(C=1000000000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False), estimator_params=None, n_features_to_select=6, step=1, verbose=0)
sel.get_support()
array([ True, False, False, False, False, True, True, True, True, True], dtype=bool)
print(np.array(features)[~sel.get_support()])
['Age' 'Age2' 'Age3' 'Parch']
print(np.array(features)[sel.get_support()])
['Pclass' 'SibSp' 'Sex_Female' 'Embarked_C' 'Embarked_Q' 'Embarked_S']
X_sel = sel.transform(X)
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()
count 10.000000 mean 0.784526 std 0.017122 min 0.764045 25% 0.771023 50% 0.786517 75% 0.788296 max 0.820225 dtype: float64