🤖⚡ scikit-learn tip #34 (video)¶

It's simple to add feature selection to a Pipeline:

Use SelectPercentile to keep the highest scoring features
Add feature selection after preprocessing but before model building

See example 👇

P.S. Make sure to tune the percentile value!

In [1]:

import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain')

In [2]:

X = df['Name']
y = df['Survived']

In [3]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

Pipeline without feature selection¶

In [4]:

vect = CountVectorizer()
clf = LogisticRegression()

In [5]:

pipe = make_pipeline(vect, clf)
cross_val_score(pipe, X, y, scoring='accuracy').mean()

Out[5]:

0.7957190383528967

Pipeline with feature selection¶

In [6]:

from sklearn.feature_selection import SelectPercentile, chi2

In [7]:

# keep 50% of features with the best chi-squared scores
selection = SelectPercentile(chi2, percentile=50)

In [8]:

pipe = make_pipeline(vect, selection, clf)
cross_val_score(pipe, X, y, scoring='accuracy').mean()

Out[8]:

0.8147824995292197

🤖⚡ scikit-learn tip #34 (video)¶

Pipeline without feature selection¶

Pipeline with feature selection¶

Want more tips? View all tips on GitHub or Sign up to receive 2 tips by email every week 💌¶