Office Hours session 2¶

Starter code (copy from here: http://bit.ly/complex-pipeline)¶

In [1]:

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [2]:

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

In [3]:

df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

In [4]:

df_new = pd.read_csv('http://bit.ly/kaggletest')
X_new = df_new[cols]

In [5]:

imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()

In [6]:

imp_ohe = make_pipeline(imp_constant, ohe)
vect = CountVectorizer()
imp = SimpleImputer()

In [7]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [8]:

logreg = LogisticRegression(solver='liblinear', random_state=1)

In [9]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

Out[9]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

K: Why did you create the `imp_ohe` pipeline? Why didn't you instead add `imp_constant` to the ColumnTransformer?¶

Here's a reminder of what imp_ohe contains:

In [10]:

imp_ohe = make_pipeline(imp_constant, ohe)

Here's how I used it in the ColumnTransformer:

In [11]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

Many people suggested that I use something like this instead (which will not work):

In [12]:

ct_suggestion = make_column_transformer(
    (imp_constant, ['Embarked', 'Sex']),
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

I'll create the 10 row dataset to help me explain why:

In [13]:

df_tiny = df.head(10).copy()
df_tiny

Out[13]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

In [14]:

X_tiny = df_tiny[cols]
X_tiny

Out[14]:

	Parch	Fare	Embarked	Sex	Name	Age
0	0	7.2500	S	male	Braund, Mr. Owen Harris	22.0
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0
2	0	7.9250	S	female	Heikkinen, Miss. Laina	26.0
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	0	8.0500	S	male	Allen, Mr. William Henry	35.0
5	0	8.4583	Q	male	Moran, Mr. James	NaN
6	0	51.8625	S	male	McCarthy, Mr. Timothy J	54.0
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard	2.0
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)	14.0

Here's a smaller ColumnTransformer that uses imp_ohe, but only on "Embarked":

There are no missing values, so the first step of the imp_ohe Pipeline passes it along to the second step
The second step of the imp_ohe Pipeline does the one-hot encoding and outputs three columns
This is exactly what we want

In [15]:

make_column_transformer(
    (imp_ohe, ['Embarked']),
    remainder='drop').fit_transform(X_tiny)

Out[15]:

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

Try splitting imp_ohe into two separate transformers:

imp_constant does the imputation, and ohe does the one-hot encoding
The results get stacked side-by-side, which is not what we want

In [16]:

make_column_transformer(
    (imp_constant, ['Embarked']),
    (ohe, ['Embarked']),
    remainder='drop').fit_transform(X_tiny)

Out[16]:

array([['S', 0.0, 0.0, 1.0],
       ['C', 1.0, 0.0, 0.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['Q', 0.0, 1.0, 0.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['C', 1.0, 0.0, 0.0]], dtype=object)

Try reversing the order of the transformers:

All this does is change the order of the output columns, which is still not what we want

In [17]:

make_column_transformer(
    (ohe, ['Embarked']),
    (imp_constant, ['Embarked']),
    remainder='drop').fit_transform(X_tiny)

Out[17]:

array([[0.0, 0.0, 1.0, 'S'],
       [1.0, 0.0, 0.0, 'C'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 1.0, 0.0, 'Q'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [1.0, 0.0, 0.0, 'C']], dtype=object)

Key ideas:

Pipeline operates sequentially
- It has steps: the output of step 1 is the input to step 2
- This applies to our larger Pipeline (pipe) and our smaller Pipeline (imp_ohe)
ColumnTransformer operates in parallel
- It does not have steps: data does not flow from one transformer to the next
- It splits up the columns, transforms them independently, and combines the results side-by-side

Take one more look at the correct ColumnTransformer (which matches the diagram above):

In [18]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [19]:

ct.fit_transform(X)

Out[19]:

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

Compare that to the ColumnTransformer many people were suggesting:

The first transformer (imp_constant) operates completely independently of the second transformer (ohe)
imp_constant doesn't pass transformed data down to the second transformer

In [20]:

ct_suggestion = make_column_transformer(
    (imp_constant, ['Embarked', 'Sex']),
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

It will error during fit_transform because the ohe transformer doesn't know how to handle missing values:

In [21]:

# ct_suggestion.fit_transform(X)

Justin: What's the cost versus the benefit of adding so many columns for Name?¶

Cross-validate our pipeline:

Use IPython's "time" magic function to time how long the operation takes
This accuracy score is our baseline score
1509 of the 1518 features were created by the "Name" column

In [22]:

from sklearn.model_selection import cross_val_score
%time cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

CPU times: user 112 ms, sys: 2.32 ms, total: 114 ms
Wall time: 114 ms

Out[22]:

0.8114619295712762

Remove "Name" from the ColumnTransformer:

Using this notation is an easy way to drop some columns ("Name") and passthrough other columns ("Parch")
This results in only 9 features

In [23]:

ct_no_name = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    ('drop', 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

Cross-validate the pipeline that doesn't include "Name":

In [24]:

no_name = make_pipeline(ct_no_name, logreg)
%time cross_val_score(no_name, X, y, cv=5, scoring='accuracy').mean()

CPU times: user 65.8 ms, sys: 1.23 ms, total: 67 ms
Wall time: 68.1 ms

Out[24]:

0.7833908731404181

Create a pipeline that only includes "Name" and cross-validate it:

Since we're only using one column, there's no need for a ColumnTransformer
Pass the "Name" Series (rather than the entire X) to cross_val_score
This results in 1509 features

In [25]:

only_name = make_pipeline(vect, logreg)
%time cross_val_score(only_name, X['Name'], y, cv=5, scoring='accuracy').mean()

CPU times: user 44 ms, sys: 1.4 ms, total: 45.4 ms
Wall time: 45.6 ms

Out[25]:

0.7945954428472788

What were the results?

The best results came from using all of our features
only_name had a slightly higher accuracy than no_name
only_name ran slightly faster than no_name (probably because there's overhead from the ColumnTransformer)

What are the benefits of including "Name"?

Better accuracy, which tells us that the "Name" column contains more signal than noise (with respect to the target)

What are the costs of including "Name"?

Less interpretability, since it's impractical to examine the coefficients of 1518 features

Other thoughts:

Including "Name" in this model is not resulting in overfitting, since it's increasing the cross-validated accuracy
- As demonstrated in this example, having more features (1518) than observations (891) does not automatically result in overfitting
It's possible that including "Name" would make some models worse, but you shouldn't assume that without trying it

Anton: What's the target accuracy that you are trying to reach?¶

For most real problems, it's impossible to know how accurate your model could be if you did enough tuning and tried enough models
It's also impossible to know how accurate your model could be if you got more observations or more features
Thus in most cases, you don't have a target accuracy, instead you just stop once you run out of time or money or ideas
The main exception is if you are working on a well-studied research problem, because in that case there may be a state-of-the-art benchmark that everyone is trying to surpass

Motasem: When using cross_val_score, is the imputation value calculated separately for each fold?¶

Here's a reminder of what our ColumnTransformer and Pipeline look like:

In [26]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [27]:

pipe = make_pipeline(ct, logreg)

Here's a reminder of how we cross-validate the Pipeline:

In [28]:

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

Out[28]:

0.8114619295712762

What happens "under the hood" when we run cross_val_score?

Step 1: cross_val_score splits the data, such that 80% is set aside for training and the remaining 20% is set aside for testing
Step 2: pipe.fit is run on the training portion, which means that the transformations are performed on the training portion and the transformed data is used to fit the model
Step 3: pipe.predict is run on the testing portion, which means that the transformations are applied to the testing portion and the transformed data is passed to the model for predictions
Step 4: The accuracy of the predictions is calculated
Steps 1 through 4 are repeated 4 more times, and each time a different 20% is set aside as the testing portion

The key point is that cross_val_score does the transformations (steps 2 and 3) after splitting the data (step 1):

Thus, the imputation values for "Age" and "Fare" are calculated 5 different times
- Each time, these values are calculated using the training portion, and applied to both the training and testing portions
- This is done specifically to avoid data leakage, which is when you inadvertently include knowledge from the testing data when training a model

Anusha: Why would missing value imputation in pandas lead to data leakage?¶

Your model evaluation procedure (cross-validation in this case) is supposed to simulate the future, so that you can accurately estimate right now how well your model will perform on new data
If you impute missing values on your whole dataset in pandas and then pass your dataset to scikit-learn, your model evaluation procedure will no longer be an accurate simulation of reality
- This is because the imputation values are based on your entire dataset, rather than just the training portion of your dataset
- Keep in mind that the "training portion" will change 5 times during 5-fold cross-validation, thus it's impractical to avoid data leakage if you use pandas for imputation

Hause: Regarding cross_val_score, what algorithm does scikit-learn use to split the data into different folds? Can you examine the data used in each fold?¶

When using cross_val_score, I've been passing an integer that specifies the number of folds:

In [29]:

cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

Out[29]:

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

Here's what happens "under the hood" when you specify 5 folds for a classification problem:

In [30]:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(5)
cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')

Out[30]:

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

StratifiedKFold is a cross-validation splitter, meaning that its role is to split datasets:

You pass the splitter to cross_val_score instead of an integer
"Stratified" means that it uses stratified sampling to ensure that the class frequencies are approximately equal in each split
- Example: 38% of the y values are 1 (which means survived), so every time it splits, it ensures that about 38% of the training portion is survived passengers and about 38% of the testing portion is survived passengers

You can examine the data used in each fold. Here are the training and testing indices used by the first split:

In [31]:

list(kf.split(X, y))[0]

Out[31]:

(array([168, 169, 170, 171, 173, 174, 175, 176, 177, 178, 179, 180, 181,
        182, 185, 188, 189, 191, 196, 197, 199, 200, 201, 202, 203, 204,
        205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217,
        218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230,
        231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243,
        244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256,
        257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
        270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282,
        283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295,
        296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308,
        309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321,
        322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334,
        335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347,
        348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360,
        361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373,
        374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386,
        387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399,
        400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
        413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425,
        426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438,
        439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451,
        452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464,
        465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477,
        478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490,
        491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503,
        504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516,
        517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529,
        530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542,
        543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555,
        556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568,
        569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581,
        582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594,
        595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607,
        608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620,
        621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633,
        634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646,
        647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659,
        660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672,
        673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685,
        686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698,
        699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711,
        712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724,
        725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737,
        738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750,
        751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763,
        764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776,
        777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789,
        790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802,
        803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815,
        816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828,
        829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841,
        842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854,
        855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867,
        868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880,
        881, 882, 883, 884, 885, 886, 887, 888, 889, 890]),
 array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
        117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
        130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
        143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
        156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 172,
        183, 184, 186, 187, 190, 192, 193, 194, 195, 198]))

By default, StratifiedKFold does not shuffle the rows:

There is no randomness in this process, thus you will always get the same results every time
This is why cross_val_score and GridSearchCV don't have a random_state parameter

If the order of your rows is not arbitrary, then you can (and should) shuffle the rows by modifying the splitter:

In [32]:

kf = StratifiedKFold(5, shuffle=True, random_state=0)

When we use grid search, we're choosing hyperparameters that maximize the cross-validation score on this dataset:

Thus, we're using the same data to do two separate jobs: tune the model hyperparameters and estimate its future performance
This biases the model to this dataset and can result in overly optimistic scores

If you want a more realistic performance estimate and you have enough data to spare:

Split the training data into two sets: the first set is passed to grid search, and second set is used to evaluate the best model chosen by grid search
The score on the second set is a more realistic estimate of model performance because the model wasn't tuned to that set

If you want a more realistic performance estimate and you don't have enough data to spare:

Use nested cross-validation, which has an outer loop and an inner loop:
- Inner loop does the hyperparameter tuning using grid search
- Outer loop estimates model performance using cross-validation

Bottom line:

If you want the most realistic estimate of a model's performance, then you have to add additional complexity to your process
If you only care about choosing the best hyperparameters, then you don't need to add this complexity

Recommended resources:

JV: What is the difference between FeatureUnion and ColumnTransformer?¶

As a reminder, you can add a "missing indicator" to SimpleImputer (new in version 0.21):

In [33]:

SimpleImputer(add_indicator=True).fit_transform(X_tiny[['Age']])

Out[33]:

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

How could we create the same output without using the "add_indicator" parameter?

We can create the left column using the SimpleImputer:

In [34]:

imp.fit_transform(X_tiny[['Age']])

Out[34]:

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

We can create the right column using the MissingIndicator class:

It outputs False and True instead of 0 and 1, but otherwise the results are the same
This is available in version 0.20

In [35]:

from sklearn.impute import MissingIndicator
indicator = MissingIndicator()
indicator.fit_transform(X_tiny[['Age']])

Out[35]:

array([[False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False]])

We can use FeatureUnion to stack these two columns side-by-side:

Use make_union to create a FeatureUnion of SimpleImputer and MissingIndicator
False and True are converted to 0 and 1 when put in a numeric array

In [36]:

from sklearn.pipeline import make_union
imp_indicator = make_union(imp, indicator)
imp_indicator.fit_transform(X_tiny[['Age']])

Out[36]:

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

Comparing FeatureUnion and ColumnTransformer:

FeatureUnion applies multiple transformations to a single input column and stacks the results side-by-side
ColumnTransformer applies a different transformation to each input column and stacks the results side-by-side

Thus we could include our FeatureUnion in a ColumnTransformer:

In [37]:

make_column_transformer(
    (imp_indicator, ['Age']),
    remainder='drop').fit_transform(X_tiny)

Out[37]:

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

Or we could achieve the same results without the FeatureUnion, by passing the "Age" column to the ColumnTransformer twice:

In [38]:

make_column_transformer(
    (imp, ['Age']),
    (indicator, ['Age']),
    remainder='drop').fit_transform(X_tiny)

Out[38]:

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

Conclusion:

In both FeatureUnion and ColumnTransformer, data does not flow from one transformer to the next
- Instead, the transformers are applied independently (in parallel)
- This is different from a Pipeline, in which data does flow from one step to the next
FeatureUnion was a precursor to ColumnTransformer
- FeatureUnion is far less useful now that ColumnTransformer exists

Gloria: Can you talk about the other two imputers in scikit-learn?¶

IterativeImputer (new in version 0.21) is experimental, meaning the API and predictions may change:

In [39]:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

Pass it "Parch" (no missing values), "Fare" (no missing values), and "Age" (1 missing value):

In [40]:

imp_iterative = IterativeImputer()
imp_iterative.fit_transform(X_tiny[['Parch', 'Fare', 'Age']])

Out[40]:

array([[ 0.        ,  7.25      , 22.        ],
       [ 0.        , 71.2833    , 38.        ],
       [ 0.        ,  7.925     , 26.        ],
       [ 0.        , 53.1       , 35.        ],
       [ 0.        ,  8.05      , 35.        ],
       [ 0.        ,  8.4583    , 24.23702669],
       [ 0.        , 51.8625    , 54.        ],
       [ 1.        , 21.075     ,  2.        ],
       [ 2.        , 11.1333    , 27.        ],
       [ 0.        , 30.0708    , 14.        ]])

How it works:

For the 9 rows in which "Age" was not missing:
- scikit-learn trains a regression model in which "Parch" and "Fare" are the features and "Age" is the target
For the 1 row in which "Age" was missing:
- scikit-learn passes the "Parch" and "Fare" values to the trained model
- The model makes a prediction for the "Age" value, and that value is used for imputation
In summary: IterativeImputer turned this into a regression problem with 2 features, 9 observations in the training set, and 1 observation in the testing set

Notes:

Unlike SimpleImputer, you have to pass it multiple numeric columns, otherwise it can't do the regression problem
- Thus you're deciding what features to use in this regression problem
You can pass it multiple features with missing values
- Meaning: "Parch", "Fare", and "Age" can all have missing values
- IterativeImputer will just do 3 different regression problems, and each column will have a turn being the target
You can choose which regression model IterativeImputer uses

KNNImputer (new in version 0.22) is another option:

In [41]:

from sklearn.impute import KNNImputer
imp_knn = KNNImputer(n_neighbors=2)
imp_knn.fit_transform(X_tiny[['Parch', 'Fare', 'Age']])

Out[41]:

array([[ 0.    ,  7.25  , 22.    ],
       [ 0.    , 71.2833, 38.    ],
       [ 0.    ,  7.925 , 26.    ],
       [ 0.    , 53.1   , 35.    ],
       [ 0.    ,  8.05  , 35.    ],
       [ 0.    ,  8.4583, 30.5   ],
       [ 0.    , 51.8625, 54.    ],
       [ 1.    , 21.075 ,  2.    ],
       [ 2.    , 11.1333, 27.    ],
       [ 0.    , 30.0708, 14.    ]])

How it works:

For the 1 row in which "Age" was missing:
- Because I set n_neighbors=2, scikit-learn finds the 2 "nearest" rows to this row, which is measured by how close the "Parch" and "Fare" values are to this row
- scikit-learn averages the "Age" values for those 2 rows (26 and 35), and that value (30.5) is used as the imputation value

The intuition behind both of these imputers is that it can be useful to take other features into account when deciding what value to impute:

Example: Maybe a high "Parch" and a low "Fare" is common for kids
Thus if "Age" is missing for a row which has a high "Parch" and a low "Fare", then you should impute a low "Age" rather than then mean of "Age", which is what SimpleImputer would do

Elin: How would you add feature selection to our Pipeline?¶

It can be useful to add feature selection to your workflow:

Model accuracy can be improved by removing irrelevant features (ones that are adding "noise", not "signal")
We can automate this process by including it in our Pipeline
I will demonstrate two of the many ways you can do this in scikit-learn: SelectPercentile and SelectFromModel

Cross-validate our pipeline (without any hyperparameter tuning) to generate a "baseline" accuracy:

In [42]:

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

Out[42]:

0.8114619295712762

SelectPercentile selects features based on statistical tests:

You specify a statistical test
It scores all features using that test
It keeps a percentage (that you specify) of the best scoring features

Create a feature selection object:

Pass it the statistical test
- We're using chi2, but other tests are available
Pass it the percentile
- We're arbitrarily using 50 to keep 50% of the features (10 would keep 10% percent of the features)

In [43]:

from sklearn.feature_selection import SelectPercentile, chi2
selection = SelectPercentile(chi2, percentile=50)

Add feature selection to the Pipeline after the ColumnTransformer but before the model:

In [44]:

pipe_selection = make_pipeline(ct, selection, logreg)

Cross-validate the updated Pipeline, and the score has improved:

In [45]:

cross_val_score(pipe_selection, X, y, cv=5, scoring='accuracy').mean()

Out[45]:

0.8193019898311469

SelectFromModel scores features using a model:

You specify a model
- Options include logistic regression, linear SVC, or any tree-based model (including ensembles)
It uses the coef_ or feature_importances_ attribute of that model as the score

We'll try using logistic regression for feature selection:

Create a new instance to use for feature selection
This is completely separate from the logistic regression model we are using to make predictions

In [46]:

logreg_selection = LogisticRegression(solver='liblinear', penalty='l1', random_state=1)

Create a feature selection object:

Pass it the model we're using for selection
Pass it the threshold
- This can be the mean or median of the scores, though you can optionally include a scaling factor (such as 1.5 times mean)
- Higher threshold means fewer features will be kept

In [47]:

from sklearn.feature_selection import SelectFromModel
selection = SelectFromModel(logreg_selection, threshold='mean')

Update the Pipeline to use the new feature selection object:

In [48]:

pipe_selection = make_pipeline(ct, selection, logreg)

Cross-validate the updated Pipeline, and the score has improved again:

In [49]:

cross_val_score(pipe_selection, X, y, cv=5, scoring='accuracy').mean()

Out[49]:

0.8260121775155358

Both of these approaches should be further tuned:

With SelectPercentile you should tune the percentile, and with SelectFromModel you should tune the threshold
You should also tune the transformer and model hyperparameters at the same time (using GridSearchCV), because you don't know which combination will produce the best results

Khaled: How would you add feature standardization to our Pipeline?¶

Some (but not all) Machine Learning models benefit from feature standardization, and this is often done with StandardScaler:

In [50]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Here's a reminder of our existing ColumnTransformer:

In [51]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

If we wanted to scale "Age" and "Fare", we could make a Pipeline of imputation and scaling:

In [52]:

imp_scaler = make_pipeline(imp, scaler)

Then replace imp with imp_scaler in our ColumnTransformer:

In [53]:

ct_scaler = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_scaler, ['Age', 'Fare']),
    remainder='passthrough')

Update the Pipeline:

In [54]:

pipe_scaler = make_pipeline(ct_scaler, logreg)

Cross-validated accuracy has decreased slightly:

Don't assume that scaling is always needed
Our particular logisitic regression solver (liblinear) is robust to unscaled data

In [55]:

cross_val_score(pipe_scaler, X, y, cv=5, scoring='accuracy').mean()

Out[55]:

0.8092210156299039

An alternative way to include scaling is to scale all columns:

StandardScaler will destroy the sparsity in a sparse matrix, so use MaxAbsScaler instead

In [56]:

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

Update the Pipeline:

Use the original ColumnTransformer (does not include scaling)
MaxAbsScaler is the second step

In [57]:

pipe_scaler = make_pipeline(ct, scaler, logreg)

Cross-validated accuracy is basically the same as our baseline (which did not include scaling):

In [58]:

cross_val_score(pipe_scaler, X, y, cv=5, scoring='accuracy').mean()

Out[58]:

0.8114556525014123

Hussain: Should we scale all features, or only the features that were originally numerical?¶

We tried both approaches above:

First, I only scaled the features that were originally numerical
Second, I scaled everything that came out of the ColumnTransformer
- MaxAbsScaler has zero effect on columns output by OneHotEncoder and only a tiny effect on columns output by CountVectorizer, so the second approach is mostly just affecting the numerical columns anyway

I suggest trying both approaches, and see which one works better:

It's not clear to me if one approach is more theoretically sound than the other

Gaurav: How would you add outlier handling to our Pipeline?¶

Here are two approaches you can try:

Use a scaler that is robust to outliers (such as RobustScaler in scikit-learn)
- You can add that to a Pipeline
Identify and remove outliers from the dataset
- There are multiple outlier detection functions in scikit-learn
- But, scikit-learn doesn't currently support transformers that remove rows, which is what you would need in order to include outlier removal in a Pipeline

Feel free to email me if you have other suggestions for outlier handling!

Leona: How would you adapt this Pipeline to use a different model, such as a RandomForestClassifier?¶

All that is required is switching out the final step of the Pipeline to use a different model:

Each model has different hyperparameters you can tune, and often scikit-learn's User Guide will suggest which ones you should tune
You should still tune all Pipeline steps simultaneously, because different models will benefit from different data transformations and different features

DS: How can I include custom transformations for feature engineering within a Pipeline?¶

Here's a reminder of what's in our 10 row dataset, so that we can plan out a few custom transformations:

In [59]:

df_tiny

Out[59]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Let's pretend we believe that "Age" and "Fare" might be better features if we floor them (meaning round them down):

This is not possible in scikit-learn, but it is possible using NumPy's floor function
- NumPy functions are a good choice for custom transformations, because pandas and scikit-learn both work well with NumPy
Notice that we pass it a 2D object and it returns a 2D object
- This will turn out to be a useful characteristic for custom transformations

In [60]:

import numpy as np
np.floor(df_tiny[['Age', 'Fare']])

Out[60]:

	Age	Fare
0	22.0	7.0
1	38.0	71.0
2	26.0	7.0
3	35.0	53.0
4	35.0	8.0
5	NaN	8.0
6	54.0	51.0
7	2.0	21.0
8	27.0	11.0
9	14.0	30.0

In order to do this transformation in scikit-learn, we need to convert the floor function into a scikit-learn transformer using FunctionTransformer:

In [61]:

from sklearn.preprocessing import FunctionTransformer

Pass the floor function to FunctionTransformer and it returns a scikit-learn transformer:

Prior to version 0.22, you should also include the argument validate=False

In [62]:

get_floor = FunctionTransformer(np.floor)

Because get_floor is a transformer, you can use the fit_transform method to perform transformations:

In [63]:

get_floor.fit_transform(df_tiny[['Age', 'Fare']])

Out[63]:

	Age	Fare
0	22.0	7.0
1	38.0	71.0
2	26.0	7.0
3	35.0	53.0
4	35.0	8.0
5	NaN	8.0
6	54.0	51.0
7	2.0	21.0
8	27.0	11.0
9	14.0	30.0

get_floor can be included in a ColumnTransformer:

In [64]:

make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    remainder='drop').fit_transform(df_tiny)

Out[64]:

array([[22.,  7.],
       [38., 71.],
       [26.,  7.],
       [35., 53.],
       [35.,  8.],
       [nan,  8.],
       [54., 51.],
       [ 2., 21.],
       [27., 11.],
       [14., 30.]])

Let's plan out a second custom transformation:

All of the cabins start with a letter, and we believe that the letter indicates the deck they were staying on, which might be a predictive feature

In [65]:

df_tiny

Out[65]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Use the pandas string slice method to extract the first letter:

Use the apply method with a lambda so that it can operate on multiple columns
This will enable our function to accept 2D input and return 2D output

In [66]:

df_tiny[['Cabin']].apply(lambda x: x.str.slice(0, 1))

Out[66]:

	Cabin
0	NaN
1	C
2	NaN
3	C
4	NaN
5	NaN
6	E
7	NaN
8	NaN
9	NaN

Convert this operation into a custom function:

You need to ensure that your function works whether the input is a DataFrame or a NumPy array
- ColumnTransformers accept both DataFrames and NumPy arrays
- Pipeline steps output NumPy arrays, which could then become the input to this function
Thus we convert the input to a DataFrame explicitly, which ensures that we will be able to use the pandas apply method

In [67]:

def first_letter(df):
    return pd.DataFrame(df).apply(lambda x: x.str.slice(0, 1))

Convert the function to a transformer:

In [68]:

get_first_letter = FunctionTransformer(first_letter)

Add this transformer to the ColumnTransformer:

In [69]:

make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    (get_first_letter, ['Cabin']),
    remainder='drop').fit_transform(df_tiny)

Out[69]:

array([[22.0, 7.0, nan],
       [38.0, 71.0, 'C'],
       [26.0, 7.0, nan],
       [35.0, 53.0, 'C'],
       [35.0, 8.0, nan],
       [nan, 8.0, nan],
       [54.0, 51.0, 'E'],
       [2.0, 21.0, nan],
       [27.0, 11.0, nan],
       [14.0, 30.0, nan]], dtype=object)

Two shape considerations to keep in mind when writing functions that will be used in a ColumnTransformer:

Your function isn't required to accept 2D input, but it's better if it accepts 2D input
- This enables your function (once transformed) to accept multiple columns in the ColumnTransformer
Your function is required to return 2D output in order to be used in a ColumnTransformer:
- If your function returns a pandas object, it should return a DataFrame (not a Series)
- If your function returns a 1D array, reshape it (within the function) to be a 2D array

Let's plan out a third custom transformation:

We believe that a passenger's total number of family members aboard ("SibSp" + "Parch") is more predictive than either feature individually

In [70]:

df_tiny

Out[70]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Use the sum method over the columns axis:

Notice that it outputs a 1D object

In [71]:

df_tiny[['SibSp', 'Parch']].sum(axis=1)

Out[71]:

0    1
1    1
2    0
3    1
4    0
5    0
6    0
7    4
8    2
9    1
dtype: int64

Convert this operation into a custom function:

Convert the input to a NumPy array explicitly, which ensures that we will be able to use the sum and reshape methods
Because the sum outputs a 1D object, use reshape to convert it to a 2D object
- This notation specifies that the second dimension should be 1 and the first dimension should be inferred

In [72]:

def sum_cols(df):
    return np.array(df).sum(axis=1).reshape(-1, 1)

Confirm that sum_cols returns 2D output:

In [73]:

sum_cols(df_tiny[['SibSp', 'Parch']])

Out[73]:

array([[1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [4],
       [2],
       [1]])

Convert the function to a transformer and add it to the ColumnTransformer:

In [74]:

get_sum = FunctionTransformer(sum_cols)

In [75]:

make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    (get_first_letter, ['Cabin']),
    (get_sum, ['SibSp', 'Parch']),
    remainder='drop').fit_transform(df_tiny)

Out[75]:

array([[22.0, 7.0, nan, 1],
       [38.0, 71.0, 'C', 1],
       [26.0, 7.0, nan, 0],
       [35.0, 53.0, 'C', 1],
       [35.0, 8.0, nan, 0],
       [nan, 8.0, nan, 0],
       [54.0, 51.0, 'E', 0],
       [2.0, 21.0, nan, 4],
       [27.0, 11.0, nan, 2],
       [14.0, 30.0, nan, 1]], dtype=object)

Let's use these custom transformers on our entire dataset!

First, add "Cabin" and "SibSp" to the list of columns:

In [76]:

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age', 'Cabin', 'SibSp']

Update X and X_new to include these columns:

In [77]:

X = df[cols]
X_new = df_new[cols]

Before we can add the custom transformers to our ColumnTransformer, we need to create two new Pipelines.

First, we need to account for the fact that "Age" and "Fare" have missing values:

Thus we create a Pipeline that does imputation before flooring

In [78]:

imp_floor = make_pipeline(imp, get_floor)

Second, there are multiple complications with using get_first_letter on "Cabin", which you can see by examining the value_counts of the first letter:

It contains missing values, so imputation will be required
Letters are strings, so one-hot encoding will be required
Some of the categories are rare (namely G and T), which can cause problems with cross-validation:
- For a rare category, it's possible for all values of that category to show up in the same testing fold during cross-validation, which will cause an error because that category was not learned during the fit step
- Thus, you have to use the handle_unknown='ignore' argument with OneHotEncoder

In [79]:

X['Cabin'].str.slice(0, 1).value_counts(dropna=False)

Out[79]:

NaN    687
C       59
B       47
D       33
E       32
A       15
F       13
G        4
T        1
Name: Cabin, dtype: int64

Create a Pipeline to get the first letter, then impute a constant value, then one-hot encode the results:

If we had done imputation as the first step (instead of the second step), the "m" of "missing" would have been extracted by get_first_letter

In [80]:

ohe_ignore = OneHotEncoder(handle_unknown='ignore')
letter_imp_ohe = make_pipeline(get_first_letter, imp_constant, ohe_ignore)

Add the three custom transformations to our primary ColumnTransformer:

Use imp_floor instead of imp on "Age" and "Fare"
Use letter_imp_ohe on "Cabin"
Use get_sum on "SibSp" and "Parch"
Change remainder to "drop" since all columns are now specified in the ColumnTransformer

In [81]:

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_floor, ['Age', 'Fare']),
    (letter_imp_ohe, ['Cabin']),
    (get_sum, ['SibSp', 'Parch']),
    remainder='drop')

Update the Pipeline, fit on X, and make predictions for X_new:

In [82]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

Out[82]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

Cross-validate the Pipeline:

The accuracy has improved from our baseline, and could be further increased through hyperparameter tuning

In [83]:

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

Out[83]:

0.8271420500910175

Even though it's more work to do these custom transformations in scikit-learn, there are some considerable benefits:

Since all of the transformations are in a Pipeline, they can all be tuned using a grid search
It won't take any extra work to apply these same transformations to new data
There's no possibility of data leakage

JV: How can I keep up-to-date with new features that are released in scikit-learn?¶

Read the what's new page for all major releases:

Scan the entire page looking for features and enhancements that interest you
- Read the class documentation if you want to know more about a new feature
- Read the GitHub pull request if you want to know the detailed history of a feature
Alternatively, you could just read the Release Highlights, but you will miss out on a lot

Otherwise, just watch out for tutorials that teach new scikit-learn features:

Feel free to email me if you have suggestions for blogs that I should follow!

My top pick is An Introduction to Statistical Learning:
- It does an excellent job explaining Machine Learning theory
- It's available as a free PDF
- It uses R, but you can read the book without knowing any R
Feel free to email me if you have read a similar book that uses scikit-learn instead!