07- Feature Selection¶

version 0.2, May 2016

Part of the class Machine Learning for Security Informatics ¶

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham

Preprocessing & Cross-validation (review)¶

In [9]:

import pandas as pd
import numpy as np
import zipfile
with zipfile.ZipFile('../datasets/titanic.csv.zip', 'r') as z:
    f = z.open('titanic.csv')
    titanic = pd.read_csv(f, sep=',', index_col=0)
titanic.head()

Out[9]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

In [10]:

titanic.Age.fillna(titanic.Age.median(), inplace=True)
titanic.loc[titanic.Embarked.isnull(), 'Embarked'] = titanic.Embarked.mode().values

In [11]:

titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [12]:

titanic['Age2'] = titanic['Age'] ** 2
titanic['Age3'] = titanic['Age'] ** 3

In [13]:

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression(C=1e9)

In [14]:

features = ['Pclass', 'Age', 'Age2', 'Age3', 'Parch', 'SibSp', 'Sex_Female', 'Embarked_C', 'Embarked_Q', 'Embarked_S'] 
X = titanic[list(features)]
y = titanic['Survived']
pd.Series(cross_val_score(logreg, X, y, cv=10, scoring='accuracy')).describe()

Out[14]:

count    10.000000
mean      0.716350
std       0.082241
min       0.611111
25%       0.630150
50%       0.735955
75%       0.772472
max       0.829545
dtype: float64

Removing features with low variance¶

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by $$\mathrm{Var}[X] = p(1 - p)$$ so we can select using the threshold .8 * (1 - .8):

In [15]:

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit(X)
sel.variances_, sel.get_support()

Out[15]:

(array([  6.98230591e-01,   1.69322249e+02,   8.01186989e+05,
          3.31010522e+09,   6.48999031e-01,   1.21467827e+00,
          2.28218083e-01,   1.53000261e-01,   7.89513794e-02,
          1.99362373e-01]),
 array([ True,  True,  True,  True,  True,  True,  True, False, False,  True], dtype=bool))

In [16]:

X_sel = sel.transform(X)
features_sel = np.array(features)[sel.get_support()]
print(np.array(features)[~sel.get_support()])

['Embarked_C' 'Embarked_Q']

In [17]:

pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()

Out[17]:

count    10.000000
mean      0.701693
std       0.082188
min       0.611111
25%       0.617978
50%       0.709613
75%       0.758427
max       0.820225
dtype: float64

Univariate feature selection¶

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:

SelectKBest removes all but the k highest scoring features
SelectPercentile removes all but a user-specified highest scoring percentage of features

using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.

GenericUnivariateSelect allows to perform univariate feature

selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator

In [18]:

from sklearn.feature_selection import SelectKBest

sel = SelectKBest(k=8)
sel.fit(X, y)
sel.get_support()

Out[18]:

array([ True,  True,  True, False,  True,  True,  True,  True, False,  True], dtype=bool)

In [19]:

print(np.array(features)[~sel.get_support()])

['Age3' 'Embarked_Q']

In [20]:

print(np.array(features)[sel.get_support()])

['Pclass' 'Age' 'Age2' 'Parch' 'SibSp' 'Sex_Female' 'Embarked_C'
 'Embarked_S']

In [21]:

X_sel = sel.transform(X)
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()

Out[21]:

count    10.000000
mean      0.804803
std       0.026880
min       0.766667
25%       0.786517
50%       0.793258
75%       0.828652
max       0.842697
dtype: float64

There is still the question of how to select the parameter k

In [22]:

from sklearn.feature_selection import SelectPercentile, f_classif

sel = SelectPercentile(f_classif, percentile=50)
sel.fit(X, y)
sel.get_support()

Out[22]:

array([ True, False, False, False,  True, False,  True,  True, False,  True], dtype=bool)

In [23]:

print(np.array(features)[~sel.get_support()])

['Age' 'Age2' 'Age3' 'SibSp' 'Embarked_Q']

In [24]:

print(np.array(features)[sel.get_support()])

['Pclass' 'Parch' 'Sex_Female' 'Embarked_C' 'Embarked_S']

In [25]:

X_sel = sel.transform(X)
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()

Out[25]:

count    10.000000
mean      0.777797
std       0.021426
min       0.741573
25%       0.766667
50%       0.774004
75%       0.786517
max       0.820225
dtype: float64

Recursive feature elimination¶

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

RFECV performs RFE in a cross-validation loop to find the optimal number of features.

In [26]:

from sklearn.feature_selection import RFE

In [27]:

sel = RFE(estimator=logreg, n_features_to_select=6)
sel.fit(X, y)

Out[27]:

RFE(estimator=LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False),
  estimator_params=None, n_features_to_select=6, step=1, verbose=0)

In [28]:

sel.get_support()

Out[28]:

array([ True, False, False, False, False,  True,  True,  True,  True,  True], dtype=bool)

In [29]:

print(np.array(features)[~sel.get_support()])

['Age' 'Age2' 'Age3' 'Parch']

In [30]:

print(np.array(features)[sel.get_support()])

['Pclass' 'SibSp' 'Sex_Female' 'Embarked_C' 'Embarked_Q' 'Embarked_S']

In [31]:

X_sel = sel.transform(X)
pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).describe()

Out[31]:

count    10.000000
mean      0.784526
std       0.017122
min       0.764045
25%       0.771023
50%       0.786517
75%       0.788296
max       0.820225
dtype: float64

In [ ]: