Course: Building an Effective ML Workflow with scikit-learn¶

Last week:¶

Review of the basic Machine Learning workflow
Encoding categorical data
Using ColumnTransformer and Pipeline
Recap
Encoding text data

This week:¶

Handling missing values
Switching to the full dataset
Recap
Evaluating and tuning a Pipeline

Starter code (copy from here: http://bit.ly/first-ml-lesson)¶

In [1]:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [2]:

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']

In [3]:

df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']

In [4]:

df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]

In [5]:

ohe = OneHotEncoder()
vect = CountVectorizer()

In [6]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

In [7]:

logreg = LogisticRegression(solver='liblinear', random_state=1)

In [8]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

Out[8]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Part 5: Handling missing values¶

We want to use "Age" as a feature, but note that it has a missing value (encoded as "NaN"):

In [9]:

df

Out[9]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Try to add the "Age" column to our model:

Fitting the pipeline will throw an error due to the presence of a missing value
scikit-learn models don't accept data with missing values (except for Histogram-based Gradient Boosting Trees)

In [10]:

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']
X = df[cols]
X

Out[10]:

	Parch	Fare	Embarked	Sex	Name	Age
0	0	7.2500	S	male	Braund, Mr. Owen Harris	22.0
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0
2	0	7.9250	S	female	Heikkinen, Miss. Laina	26.0
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	0	8.0500	S	male	Allen, Mr. William Henry	35.0
5	0	8.4583	Q	male	Moran, Mr. James	NaN
6	0	51.8625	S	male	McCarthy, Mr. Timothy J	54.0
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard	2.0
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)	14.0

In [11]:

# pipe.fit(X, y)

One option is to drop any rows from the DataFrame that have missing values:

This can be a useful approach, but only if you know that the missingness is random and it only affects a small portion of your dataset
If a lot of your rows have missing values, then this approach will throw away too much useful training data

In [12]:

X.dropna()

Out[12]:

	Parch	Fare	Embarked	Sex	Name	Age
0	0	7.2500	S	male	Braund, Mr. Owen Harris	22.0
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0
2	0	7.9250	S	female	Heikkinen, Miss. Laina	26.0
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	0	8.0500	S	male	Allen, Mr. William Henry	35.0
6	0	51.8625	S	male	McCarthy, Mr. Timothy J	54.0
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard	2.0
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)	14.0

A second option is to drop any features that have missing values:

However, you may be throwing away a useful feature

In [13]:

X.dropna(axis='columns')

Out[13]:

	Parch	Fare	Embarked	Sex	Name
0	0	7.2500	S	male	Braund, Mr. Owen Harris
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	0	7.9250	S	female	Heikkinen, Miss. Laina
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	0	8.0500	S	male	Allen, Mr. William Henry
5	0	8.4583	Q	male	Moran, Mr. James
6	0	51.8625	S	male	McCarthy, Mr. Timothy J
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)

A third option is to impute missing values:

Imputation means that you are filling in missing values based on what you know from the non-missing data
Carefully consider the costs and benefits of imputation before proceeding, because you are making up data

Use SimpleImputer to perform the imputation:

It requires 2-dimensional input (just like OneHotEncoder)
By default, it fills missing values with the mean of the non-missing values
It also supports other imputation strategies: median value, most frequent value, or a user-defined value

In [14]:

from sklearn.impute import SimpleImputer
imp = SimpleImputer()
imp.fit_transform(X[['Age']])

Out[14]:

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

Examine the statistics_ attribute (which was learned during the fit step) to see what value was imputed:

In [15]:

imp.statistics_

Out[15]:

array([28.11111111])

Update the ColumnTransformer to include the SimpleImputer:

Brackets are required around "Age" because SimpleImputer expects 2-dimensional input
Reminder: Brackets are not allowed around "Name" because CountVectorizer expects 1-dimensional input

In [16]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough')

In [17]:

ct.fit_transform(X)

Out[17]:

<10x48 sparse matrix of type '<class 'numpy.float64'>'
	with 88 stored elements in Compressed Sparse Row format>

Update the Pipeline to include the revised ColumnTransformer, and fit it on X and y:

In [18]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);

Examine the "named_steps" to confirm that the Pipeline looks correct:

In [19]:

pipe.named_steps

Out[19]:

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex']),
                                 ('countvectorizer',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=...
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                                  max_features=None, min_df=1,
                                                  ngram_range=(1, 1),
                                                  preprocessor=None,
                                                  stop_words=None,
                                                  strip_accents=None,
                                                  token_pattern='(?u)\\b\\w\\w+\\b',
                                                  tokenizer=None,
                                                  vocabulary=None),
                                  'Name'),
                                 ('simpleimputer',
                                  SimpleImputer(add_indicator=False, copy=True,
                                                fill_value=None,
                                                missing_values=nan,
                                                strategy='mean', verbose=0),
                                  ['Age'])],
                   verbose=False),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                    warm_start=False)}

Update X_new to use the same columns as X, and then make predictions:

In [20]:

X_new = df_new[cols]
pipe.predict(X_new)

Out[20]:

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

What happened during the predict step?

If X_new didn't have any missing values in "Age", then nothing gets imputed during prediction
If X_new did have missing values in "Age", then the imputation value is the mean of "Age" in X (which was 28.11), not the mean of "Age" in X_new
- This is important because you are only allowed to learn from the training data, and then apply what you learned to both the training and testing data
- This is why we fit_transform on training data, and transform (only) on testing data
During prediction, every row (in X_new) is considered independently and predictions are done one at a time
- Thus if you passed a single row to the predict method, it becomes obvious that scikit-learn has to look to the training data for the imputation value

When imputing missing values, you can also add "missingness" as a feature:

Set "add_indicator=True" (new in version 0.21) to add a binary indicator matrix indicating the presence of missing values
This is useful when the data is not missing at random, since there might be a relationship between "missingness" and the target
- Example: If "Age" is missing because older passengers declined to give their ages, and older passengers are more likely to have survived, then there is a relationship between "missing Age" and "Survived"

In [21]:

imp_indicator = SimpleImputer(add_indicator=True)
imp_indicator.fit_transform(X[['Age']])

Out[21]:

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

There are also other imputers available in scikit-learn:

IterativeImputer (new in version 0.21)
KNNImputer (new in version 0.22)

These new imputers will produce more useful imputations than SimpleImputer in some (but not all) cases.

Part 6: Switching to the full dataset¶

Read the full datasets into df and df_new:

In [22]:

df = pd.read_csv('http://bit.ly/kaggletrain')
df.shape

Out[22]:

(891, 12)

In [23]:

df_new = pd.read_csv('http://bit.ly/kaggletest')
df_new.shape

Out[23]:

(418, 11)

Check for missing values in the full datasets:

There are two new problems we'll have to handle that weren't present in our smaller datasets:
- Problem 1: "Embarked" has missing values in df
- Problem 2: "Fare" has missing values in df_new

In [24]:

df.isna().sum()

Out[24]:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [25]:

df_new.isna().sum()

Out[25]:

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Redefine X and y for the full dataset:

In [26]:

X = df[cols]
y = df['Survived']

fit_transform will error since "Embarked" contains missing values (problem 1):

In [27]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough')

In [28]:

# ct.fit_transform(X)

We'll solve problem 1 by imputing missing values for "Embarked" before one-hot encoding it.

First create a new imputer:

For categorical features, you can impute the most frequent value or a user-defined value
We'll impute a user-defined value of "missing" (a string):
- This essentially treats missing values as a fourth category, and it will become a fourth column during one-hot encoding
- This is similar (but not identical) to imputing the most frequent value and then adding a missing indicator

In [29]:

imp_constant = SimpleImputer(strategy='constant', fill_value='missing')

Next create a Pipeline of two transformers:

Step 1 is imputation, and step 2 is one-hot encoding
fit_transform on "Embarked" now outputs four columns (rather than three)

In [30]:

imp_ohe = make_pipeline(imp_constant, ohe)

In [31]:

imp_ohe.fit_transform(X[['Embarked']])

Out[31]:

<891x4 sparse matrix of type '<class 'numpy.float64'>'
	with 891 stored elements in Compressed Sparse Row format>

This is what happens "under the hood" when you fit_transform the Pipeline:

In [32]:

ohe.fit_transform(imp_constant.fit_transform(X[['Embarked']]))

Out[32]:

<891x4 sparse matrix of type '<class 'numpy.float64'>'
	with 891 stored elements in Compressed Sparse Row format>

Here are the rules for Pipelines:

All Pipeline steps (other than the final step) must be a transformer, and the final step can be a model or a transformer
Our larger Pipeline (called "pipe") ends in a model, and thus we use the fit and predict methods with it
Our smaller Pipeline (called "imp_ohe") ends in a transformer, and thus we use the fit_transform and transform methods with it

Replace "ohe" with "imp_ohe" in the ColumnTransformer:

You can use any transformer inside of a ColumnTransformer, and "imp_ohe" is eligible since it acts like a transformer
It's fine to apply "imp_ohe" to "Sex" as well as "Embarked":
- There are no missing values in "Sex" so the imputation step won't affect it

In [33]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough')

We have solved problem 1, so we can now fit_transform on X:

The feature matrix is much wider than before because "Name" has a ton of unique words

In [34]:

ct.fit_transform(X)

Out[34]:

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

We'll solve problem 2 by imputing missing values for "Fare":

Modify the ColumnTransformer to apply the "imp" transformer to "Fare"
Remember that "Fare" only has missing values in X_new, but not in X:
- When the imputer is fit to X, it will learn the imputation value that will be applied to X_new during prediction

In [35]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

fit_transform outputs the same number of columns as before, since "Fare" just moved from a passthrough column to a transformed column:

In [36]:

ct.fit_transform(X)

Out[36]:

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

Update the Pipeline to include the revised ColumnTransformer, and fit it on X and y:

In [37]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);

Update X_new to use the same columns as X, and then make predictions:

In [38]:

X_new = df_new[cols]
pipe.predict(X_new)

Out[38]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

Recap¶

This is all of the code that is necessary to recreate our workflow up to this point:

You can copy/paste this code from http://bit.ly/complex-pipeline
There are no calls to "fit_transform" or "transform" because all of that functionality is encapsulated by the Pipeline

In [39]:

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [40]:

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

In [41]:

df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

In [42]:

df_new = pd.read_csv('http://bit.ly/kaggletest')
X_new = df_new[cols]

In [43]:

imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()

In [44]:

imp_ohe = make_pipeline(imp_constant, ohe)
vect = CountVectorizer()
imp = SimpleImputer()

In [45]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [46]:

logreg = LogisticRegression(solver='liblinear', random_state=1)

In [47]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

Out[47]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

Comparing Pipeline and ColumnTransformer:

ColumnTransformer pulls out subsets of columns and transforms them independently, and then stacks the results side-by-side
Pipeline is a series of steps that occur in order, and the output of each step passes to the next step

Why wouldn't we do all of the transformations in pandas, and just use scikit-learn for model building?

CountVectorizer is a highly useful technique for encoding text data, and it can't be done using pandas

Using both pandas and scikit-learn for transformations adds workflow complexity, especially if you have to combine a dense matrix (output by pandas) and a sparse matrix (output by CountVectorizer)

One-hot encoding can be done using pandas, but you will probably add those columns to your DataFrame

This makes the DataFrame larger and more difficult to navigate

Missing value imputation can be done using pandas, but it will result in data leakage

What is data leakage?

Inadvertently including knowledge from the testing data when training a model

Why is data leakage bad?

Your model evaluation scores will be less reliable
- This may lead you to make bad decisions when tuning hyperparameters
- This will lead you to overestimate how well your model will perform on new data
It's hard to know whether your scores will be off by a negligible amount or a huge amount

Why would missing value imputation in pandas cause data leakage?

Your model evaluation procedure (such as cross-validation) is supposed to simulate the future, so that you can accurately estimate right now how well your model will perform on new data
If you impute missing values on your whole dataset in pandas and then pass your dataset to scikit-learn, your model evaluation procedure will no longer be an accurate simulation of reality
- This is because the imputation values are based on your entire dataset, rather than just the training portion of your dataset
- Keep in mind that the "training portion" will change 5 times during 5-fold cross-validation, thus it's quite impractical to avoid data leakage if you use pandas for imputation

What other transformations in pandas will cause data leakage?

Feature scaling
One-hot encoding (unless there is a fixed set of categories)
Any transformations which incorporate information about other rows when transforming a row

How does scikit-learn prevent data leakage?

It has separate fit and transform steps, which allow you to base your data transformations on the training set only, and then apply those transformations to both the training set and the testing set
Pipeline's fit and predict methods ensure that fit_transform and transform are called at the appropriate times
cross_val_score and GridSearchCV split the data prior to performing data transformations

Part 7: Evaluating and tuning a Pipeline¶

We can use cross_val_score on the entire Pipeline to estimate its classification accuracy:

Cross-validation is a useful tool now that we're using the full dataset
We're using 5 folds because it has been shown to be a reasonable default choice
cross_val_score performs the data transformations (specified in the ColumnTransformer) after each of the 5 data splits in order to prevent data leakage
- If it performed the data transformations before the data splits, that would have resulted in data leakage

In [48]:

from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

Out[48]:

0.8114619295712762

Our next step is to tune the hyperparameters for both the model and the transformers:

We have been using the default hyperparameters for most objects
"Hyperparameters" are values you set, whereas "parameters" are values learned by the estimator during the fitting process
Hyperparameter tuning is likely to result in a more accurate model

We'll use GridSearchCV for hyperparameter tuning:

You define what values you want to try for each hyperparameter, and it cross-validates every possible combination of those values
You have to tune hyperparameters together, since the best performing combination might be when none of them are at their default values
Being able to tune the transformers simultaneous to the model is yet another benefit of doing transformations in scikit-learn rather than pandas

Because we're tuning a Pipeline, we need to know the step names from named_steps:

In [49]:

pipe.named_steps.keys()

Out[49]:

dict_keys(['columntransformer', 'logisticregression'])

Specify the hyperparameters and values to try in a dictionary:

Create an empty dictionary called params
For our logistic regression model, we will tune:
- penalty: type of regularization (default is 'l2')
- C: amount of regularization (default is 1.0)
- Choosing which hyperparameters to tune and what values to try requires both research and experience
The dictionary key is the step name, followed by 2 underscores, followed by the hyperparameter name
The dictionary value is the list of values you want to try for that hyperparameter

In [50]:

params = {}
params['logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params

Out[50]:

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10]}

Set up the grid search:

Creating a GridSearchCV instance is similar to cross_val_score, except that you don't pass X and y but you do pass params
Fitting the GridSearchCV object performs the grid search

In [51]:

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y);

Convert the results of the grid search into a DataFrame:

6 rows means that it ran cross-validation 6 times, which is every possible combination of C (3 values) and penalty (2 values)

In [52]:

results = pd.DataFrame(grid.cv_results_)
results

Out[52]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_logisticregression__C	param_logisticregression__penalty	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.013179	0.001090	0.006131	0.001403	0.1	l1	{'logisticregression__C': 0.1, 'logisticregres...	0.787709	0.803371	0.769663	0.758427	0.797753	0.783385	0.016946	6
1	0.012467	0.000274	0.004867	0.000117	0.1	l2	{'logisticregression__C': 0.1, 'logisticregres...	0.798883	0.803371	0.764045	0.775281	0.803371	0.788990	0.016258	5
2	0.013442	0.000392	0.004720	0.000045	1	l1	{'logisticregression__C': 1, 'logisticregressi...	0.815642	0.820225	0.797753	0.792135	0.848315	0.814814	0.019787	2
3	0.012881	0.000346	0.004768	0.000058	1	l2	{'logisticregression__C': 1, 'logisticregressi...	0.798883	0.825843	0.803371	0.786517	0.842697	0.811462	0.020141	3
4	0.018128	0.002229	0.004792	0.000173	10	l1	{'logisticregression__C': 10, 'logisticregress...	0.821229	0.814607	0.814607	0.792135	0.848315	0.818178	0.018007	1
5	0.013615	0.000414	0.004737	0.000087	10	l2	{'logisticregression__C': 10, 'logisticregress...	0.782123	0.803371	0.808989	0.797753	0.853933	0.809234	0.024080	4

Sort the DataFrame by "rank_test_score":

Our column of interest is "mean_test_score"
Best result was C=10 and penalty='l1', neither of which was the default

In [53]:

results.sort_values('rank_test_score')

Out[53]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_logisticregression__C	param_logisticregression__penalty	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
4	0.018128	0.002229	0.004792	0.000173	10	l1	{'logisticregression__C': 10, 'logisticregress...	0.821229	0.814607	0.814607	0.792135	0.848315	0.818178	0.018007	1
2	0.013442	0.000392	0.004720	0.000045	1	l1	{'logisticregression__C': 1, 'logisticregressi...	0.815642	0.820225	0.797753	0.792135	0.848315	0.814814	0.019787	2
3	0.012881	0.000346	0.004768	0.000058	1	l2	{'logisticregression__C': 1, 'logisticregressi...	0.798883	0.825843	0.803371	0.786517	0.842697	0.811462	0.020141	3
5	0.013615	0.000414	0.004737	0.000087	10	l2	{'logisticregression__C': 10, 'logisticregress...	0.782123	0.803371	0.808989	0.797753	0.853933	0.809234	0.024080	4
1	0.012467	0.000274	0.004867	0.000117	0.1	l2	{'logisticregression__C': 0.1, 'logisticregres...	0.798883	0.803371	0.764045	0.775281	0.803371	0.788990	0.016258	5
0	0.013179	0.001090	0.006131	0.001403	0.1	l1	{'logisticregression__C': 0.1, 'logisticregres...	0.787709	0.803371	0.769663	0.758427	0.797753	0.783385	0.016946	6

In order to tune the transformers, we need to know their names:

In [54]:

pipe.named_steps.columntransformer.named_transformers_

Out[54]:

{'pipeline': Pipeline(memory=None,
          steps=[('simpleimputer',
                  SimpleImputer(add_indicator=False, copy=True,
                                fill_value='missing', missing_values=nan,
                                strategy='constant', verbose=0)),
                 ('onehotencoder',
                  OneHotEncoder(categories='auto', drop=None,
                                dtype=<class 'numpy.float64'>,
                                handle_unknown='error', sparse=True))],
          verbose=False),
 'countvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=1.0, max_features=None, min_df=1,
                 ngram_range=(1, 1), preprocessor=None, stop_words=None,
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None),
 'simpleimputer': SimpleImputer(add_indicator=False, copy=True, fill_value=None,
               missing_values=nan, strategy='mean', verbose=0),
 'remainder': 'passthrough'}

Tune the "drop" hyperparameter of OneHotEncoder by adding it to the params dictionary:

Pipeline step: "columntransformer"
First transformer: "pipeline"
Second step of the inner pipeline: "onehotencoder"
Hyperparameter: "drop"
Separate each of these components by 2 underscores

Try the values None and 'first':

None is the default
'first' means drop the first level of each feature after encoding (new in version 0.21)

In [55]:

params['columntransformer__pipeline__onehotencoder__drop'] = [None, 'first']

Tune the "ngram_range" hyperparameter of CountVectorizer:

Pipeline step: "columntransformer"
Second transformer: "countvectorizer"
Hyperparameter: "ngram_range" (note the single underscore)

Try the values (1, 1) and (1, 2):

(1, 1) is the default, which creates a single feature from each word
(1, 2) creates features from both single words and word pairs

In [56]:

params['columntransformer__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]

Tune the "add_indicator" hyperparameter of SimpleImputer:

Pipeline step: "columntransformer"
Third transformer: "simpleimputer"
Hyperparameter: "add_indicator" (note the single underscore)

Try the values False and True:

False is the default
True means add a binary indicator matrix (new in version 0.21)

In [57]:

params['columntransformer__simpleimputer__add_indicator'] = [False, True]

Examine the params dictionary for any typos:

In [58]:

params

Out[58]:

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Perform the grid search again:

There are 48 combinations to try, so it takes 8 times longer than the previous search

In [59]:

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y);

Sort and review the search results:

Accuracy of the best model is an improvement over the previous grid search
It's hard to pick out trends for each hyperparameter because many of them affect one another

In [60]:

results = pd.DataFrame(grid.cv_results_)
results.sort_values('rank_test_score')

Out[60]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_columntransformer__countvectorizer__ngram_range	param_columntransformer__pipeline__onehotencoder__drop	param_columntransformer__simpleimputer__add_indicator	param_logisticregression__C	param_logisticregression__penalty	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
28	0.023061	0.002269	0.005146	0.000024	(1, 2)	None	False	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.860335	0.820225	0.820225	0.786517	0.859551	0.829370	0.027833	1
46	0.029656	0.003894	0.005461	0.000197	(1, 2)	first	True	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.849162	0.831461	0.820225	0.786517	0.853933	0.828259	0.024138	2
40	0.030675	0.002119	0.005186	0.000081	(1, 2)	first	False	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.854749	0.825843	0.814607	0.786517	0.848315	0.826006	0.024549	3
34	0.023133	0.001805	0.005422	0.000201	(1, 2)	None	True	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.849162	0.820225	0.820225	0.780899	0.853933	0.824889	0.026120	4
10	0.020138	0.002229	0.005470	0.000890	(1, 1)	None	True	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.826816	0.814607	0.820225	0.780899	0.853933	0.819296	0.023467	5
22	0.021249	0.001699	0.004953	0.000112	(1, 1)	first	True	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.821229	0.803371	0.825843	0.780899	0.859551	0.818178	0.026034	6
4	0.018240	0.001837	0.004744	0.000091	(1, 1)	None	False	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.821229	0.814607	0.814607	0.792135	0.848315	0.818178	0.018007	6
20	0.014477	0.001047	0.005164	0.000368	(1, 1)	first	True	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.810056	0.820225	0.797753	0.792135	0.853933	0.814820	0.021852	8
2	0.013728	0.000415	0.004839	0.000075	(1, 1)	None	False	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.815642	0.820225	0.797753	0.792135	0.848315	0.814814	0.019787	9
16	0.021138	0.001391	0.004800	0.000139	(1, 1)	first	False	10	l1	{'columntransformer__countvectorizer__ngram_ra...	0.821229	0.803371	0.814607	0.780899	0.853933	0.814808	0.023886	10
44	0.018747	0.001117	0.005938	0.000527	(1, 2)	first	True	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.804469	0.820225	0.797753	0.792135	0.853933	0.813703	0.022207	11
47	0.018135	0.000447	0.005382	0.000114	(1, 2)	first	True	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.820225	0.820225	0.780899	0.853933	0.812598	0.026265	12
8	0.013765	0.000456	0.004881	0.000127	(1, 1)	None	True	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.804469	0.820225	0.786517	0.792135	0.859551	0.812579	0.026183	13
14	0.013688	0.000971	0.004796	0.000181	(1, 1)	first	False	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.804469	0.820225	0.797753	0.792135	0.848315	0.812579	0.020194	14
38	0.017634	0.000525	0.005225	0.000081	(1, 2)	first	False	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.804469	0.820225	0.797753	0.792135	0.848315	0.812579	0.020194	14
11	0.014208	0.000597	0.005329	0.000715	(1, 1)	None	True	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.803371	0.808989	0.792135	0.870787	0.811481	0.031065	16
21	0.013204	0.000708	0.004958	0.000345	(1, 1)	first	True	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.793296	0.820225	0.803371	0.786517	0.853933	0.811468	0.024076	17
3	0.013157	0.000342	0.004966	0.000461	(1, 1)	None	False	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.825843	0.803371	0.786517	0.842697	0.811462	0.020141	18
26	0.017373	0.000133	0.005122	0.000050	(1, 2)	None	False	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.810056	0.820225	0.786517	0.792135	0.848315	0.811449	0.022058	19
23	0.013665	0.000259	0.004913	0.000128	(1, 1)	first	True	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.776536	0.803371	0.808989	0.792135	0.870787	0.810363	0.032182	20
9	0.012987	0.000245	0.004787	0.000074	(1, 1)	None	True	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.793296	0.825843	0.797753	0.786517	0.848315	0.810345	0.023233	21
15	0.012510	0.000078	0.004724	0.000065	(1, 1)	first	False	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.804469	0.820225	0.803371	0.786517	0.837079	0.810332	0.017107	22
32	0.017513	0.000521	0.005245	0.000034	(1, 2)	None	True	1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.804469	0.820225	0.780899	0.792135	0.853933	0.810332	0.025419	22
17	0.013241	0.000162	0.004707	0.000095	(1, 1)	first	False	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.803371	0.808989	0.797753	0.853933	0.809234	0.024080	24
35	0.018231	0.000590	0.005368	0.000091	(1, 2)	None	True	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.820225	0.814607	0.780899	0.848315	0.809234	0.025357	24
5	0.013436	0.000172	0.004653	0.000031	(1, 1)	None	False	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.803371	0.808989	0.797753	0.853933	0.809234	0.024080	24
29	0.023017	0.011150	0.005115	0.000026	(1, 2)	None	False	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.814607	0.820225	0.780899	0.837079	0.808104	0.020904	27
45	0.017329	0.000598	0.005484	0.000115	(1, 2)	first	True	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.793296	0.814607	0.797753	0.786517	0.848315	0.808097	0.022143	28
41	0.017454	0.000328	0.005192	0.000138	(1, 2)	first	False	10	l2	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.814607	0.820225	0.780899	0.831461	0.806980	0.019414	29
39	0.016762	0.000371	0.005216	0.000141	(1, 2)	first	False	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.808989	0.797753	0.786517	0.837079	0.805844	0.017164	30
27	0.016690	0.000149	0.005101	0.000031	(1, 2)	None	False	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.814607	0.792135	0.786517	0.837079	0.805844	0.018234	30
33	0.016940	0.000164	0.005267	0.000074	(1, 2)	None	True	1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.814607	0.792135	0.786517	0.848315	0.804739	0.024489	32
31	0.016125	0.000202	0.005330	0.000113	(1, 2)	None	True	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.803371	0.769663	0.786517	0.814607	0.794608	0.015380	33
7	0.012880	0.001058	0.005017	0.000315	(1, 1)	None	True	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.803371	0.764045	0.786517	0.814607	0.793484	0.017253	34
19	0.012406	0.000379	0.004833	0.000086	(1, 1)	first	True	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.793296	0.803371	0.764045	0.780899	0.814607	0.791243	0.017572	35
43	0.016018	0.000072	0.005258	0.000042	(1, 2)	first	True	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.797753	0.764045	0.780899	0.808989	0.790114	0.015849	36
37	0.016297	0.001262	0.005391	0.000457	(1, 2)	first	False	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.803371	0.764045	0.780899	0.808989	0.789003	0.016100	37
25	0.015791	0.000137	0.005094	0.000033	(1, 2)	None	False	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.793296	0.803371	0.764045	0.775281	0.808989	0.788996	0.016944	38
1	0.012500	0.000984	0.004949	0.000352	(1, 1)	None	False	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.798883	0.803371	0.764045	0.775281	0.803371	0.788990	0.016258	39
13	0.011891	0.000116	0.004815	0.000203	(1, 1)	first	False	0.1	l2	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.803371	0.764045	0.780899	0.808989	0.787885	0.016343	40
0	0.014173	0.001353	0.005162	0.000298	(1, 1)	None	False	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.803371	0.769663	0.758427	0.797753	0.783385	0.016946	41
24	0.015586	0.000120	0.005175	0.000132	(1, 2)	None	False	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.803371	0.769663	0.758427	0.797753	0.783385	0.016946	41
6	0.012031	0.000139	0.004970	0.000330	(1, 1)	None	True	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.787709	0.803371	0.769663	0.758427	0.797753	0.783385	0.016946	41
30	0.016099	0.000485	0.005302	0.000050	(1, 2)	None	True	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.782123	0.803371	0.769663	0.758427	0.797753	0.782267	0.016807	44
36	0.015749	0.000290	0.005128	0.000038	(1, 2)	first	False	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.770950	0.797753	0.769663	0.758427	0.792135	0.777785	0.014779	45
42	0.016111	0.000190	0.005278	0.000020	(1, 2)	first	True	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.770950	0.797753	0.769663	0.758427	0.792135	0.777785	0.014779	45
12	0.012544	0.000763	0.004768	0.000070	(1, 1)	first	False	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.770950	0.797753	0.769663	0.758427	0.792135	0.777785	0.014779	45
18	0.012307	0.000567	0.004854	0.000064	(1, 1)	first	True	0.1	l1	{'columntransformer__countvectorizer__ngram_ra...	0.770950	0.797753	0.769663	0.758427	0.792135	0.777785	0.014779	45

Access the single best score and best set of hyperparameters:

Two of the hyperparameters used the default values (drop, add_indicator)
Three of the hyperparameters did not use the default values (ngram_range, C, penalty)

In [61]:

grid.best_score_

Out[61]:

0.8293704098926622

In [62]:

grid.best_params_

Out[62]:

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': False,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

You can use the GridSearchCV object to make predictions:

It automatically refits the Pipeline on all of the data (X and y) with the best set of hyperparameters

In [63]:

grid.predict(X_new)

Out[63]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])