# Course: Building an Effective ML Workflow with scikit-learn¶

### Outline:¶

• Review of the basic Machine Learning workflow
• Encoding categorical data
• Using ColumnTransformer and Pipeline
• Recap
• Encoding text data

## Part 1: Review of the basic Machine Learning workflow¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.22.1'

Load 10 rows from the famous Titanic dataset:

In [2]:
import pandas as pd


Basic terminology:

• "Survived" is the target column
• Target is categorical, thus it's a classification problem
• All other columns are possible features
• Each row is an observation, and represents a passenger
• This is our training data because we know the target values
In [3]:
df

Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

We want to use "Parch" and "Fare" as initial features:

• "Parch" is the number of parents or children aboard with each passenger
• "Fare" is the amount they paid
• Both are numeric

Define X and y:

• X is the feature matrix
• y is the target
In [4]:
X = df[['Parch', 'Fare']]
X

Out[4]:
Parch Fare
0 0 7.2500
1 0 71.2833
2 0 7.9250
3 0 53.1000
4 0 8.0500
5 0 8.4583
6 0 51.8625
7 1 21.0750
8 2 11.1333
9 0 30.0708
In [5]:
y = df['Survived']
y

Out[5]:
0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

Check the object shapes:

• X is a pandas DataFrame with 2 columns, thus it has 2 dimensions
• y is a pandas Series, thus it has 1 dimension
In [6]:
X.shape

Out[6]:
(10, 2)
In [7]:
y.shape

Out[7]:
(10,)

Create a model object:

• Set the "solver" to increase the likelihood that we will all get the same results
• Set the "random_state" for reproducibility
In [8]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)


Evaluate the model using cross-validation:

• Our goal is to simulate model performance on future data so that we can choose between models
• Evaluation metric is classification accuracy
• "cross_val_score" does the dataset splitting, training, predictions, and evaluation
• We can't take these results seriously because the dataset is tiny
In [9]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()

Out[9]:
0.6944444444444443

Train the model on the entire dataset:

In [10]:
logreg.fit(X, y)

Out[10]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=1, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)

Read in a new dataset for which we don't know the target values:

In [11]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
df_new

Out[11]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S

Define X_new to have the same columns as X:

In [12]:
X_new = df_new[['Parch', 'Fare']]
X_new

Out[12]:
Parch Fare
0 0 7.8292
1 0 7.0000
2 0 9.6875
3 0 8.6625
4 1 12.2875
5 0 9.2250
6 0 7.6292
7 1 29.0000
8 0 7.2292
9 0 24.1500

Use the trained model to make predictions for X_new:

In [13]:
logreg.predict(X_new)

Out[13]:
array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

## Part 2: Encoding categorical data¶

We want to use "Embarked" and "Sex" as additional features:

• "Embarked" is the port they embarked from
• "Sex" is male or female
• They are unordered categorical features
• They can't be directly passed to the model because they aren't numeric
In [14]:
df

Out[14]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Encode "Embarked" using one-hot encoding:

• This is the same as "dummy encoding"
• Outputs a sparse matrix, which is more efficient and performant when most values in a matrix are zeros
• Use two brackets around "Embarked" to pass a DataFrame instead of a Series
In [15]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])

Out[15]:
<10x3 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>

Ask for a dense (not sparse) matrix so that we can examine the encoding:

• There are 3 columns because there were 3 unique values in "Embarked"
• Each row contains a single 1
• 100 means "C", 010 means "Q", 001 means "S"
• The categories are listed in alphabetical order in the "categories_" attribute
• You can think of "categories_" as the column headings for the matrix
• From each of the three features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port
In [16]:
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked']])

Out[16]:
array([[0., 0., 1.],
[1., 0., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[1., 0., 0.]])
In [17]:
ohe.categories_

Out[17]:
[array(['C', 'Q', 'S'], dtype=object)]

What's the difference between "fit" and "transform"?

• OneHotEncoder is a "transformer", meaning its role is data transformations
• Transformers usually have a "fit" method and always have a "transform" method
• For all transformers: "fit" is when they learn something, and "transform" is when they use what they learned to do the transformation
• For OneHotEncoder: "fit" is when it learns the categories, and "transform" is when it creates the matrix using those categories
• If you are going to "fit" and "transform", then you should do it in a single step using "fit_transform"

Encode "Embarked" and "Sex" at the same time:

• First 3 columns represent "Embarked" and last two columns represent "Sex"
• For the "Sex" columns: 10 means "female", 01 means "male"
In [18]:
ohe.fit_transform(df[['Embarked', 'Sex']])

Out[18]:
array([[0., 0., 1., 0., 1.],
[1., 0., 0., 1., 0.],
[0., 0., 1., 1., 0.],
[0., 0., 1., 1., 0.],
[0., 0., 1., 0., 1.],
[0., 1., 0., 0., 1.],
[0., 0., 1., 0., 1.],
[0., 0., 1., 0., 1.],
[0., 0., 1., 1., 0.],
[1., 0., 0., 1., 0.]])
In [19]:
ohe.categories_

Out[19]:
[array(['C', 'Q', 'S'], dtype=object), array(['female', 'male'], dtype=object)]

How could we include "Embarked" and "Sex" in the model along with "Parch" and "Fare"?

• Stack the 2 numeric features side-by-side with the 5 encoded columns, and then train the model with all 7 columns
• However, we would need to repeat the same process (encoding and stacking) with the new data before making predictions
• Doing this manually is inefficient and error-prone, and the complexity will only increase as you preprocess additional columns

## Part 3: Using ColumnTransformer and Pipeline¶

Goals:

• Use ColumnTransformer to make it easy to apply different preprocessing to different columns
• Use Pipeline to make it easy to apply the same workflow to training data and new data

Create a list of columns and use that to update X:

In [20]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

In [21]:
X = df[cols]
X

Out[21]:
Parch Fare Embarked Sex
0 0 7.2500 S male
1 0 71.2833 C female
2 0 7.9250 S female
3 0 53.1000 S female
4 0 8.0500 S male
5 0 8.4583 Q male
6 0 51.8625 S male
7 1 21.0750 S male
8 2 11.1333 S female
9 0 30.0708 C female

Create an instance of OneHotEncoder with the default options:

In [22]:
ohe = OneHotEncoder()


Create a ColumnTransformer:

• First argument (a tuple) specifies that we want to one-hot encode the "Embarked" and "Sex" columns
• "remainder" argument specifies that we want to keep all other columns in the final output (without modifying them)
In [23]:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
remainder='passthrough')


Perform the transformation:

• Output contains 7 columns in this order: 3 columns for "Embarked", 2 for "Sex", 1 for "Parch", and 1 for "Fare"
• Column order is the order in which you listed them in the ColumnTransformer followed by any you passthrough
In [24]:
ct.fit_transform(X)

Out[24]:
array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
[ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
[ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
[ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
[ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
[ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
[ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

Use Pipeline to chain together sequential steps:

• Step 1 is data preprocessing using ColumnTransformer
• Step 2 is model building using LogisticRegression
In [25]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)


Fit the Pipeline:

• Step 1: X gets transformed from 4 columns to 7 columns by ColumnTransformer
• Step 2: LogisticRegression model gets fit, thus it learns the relationship between those 7 features and the y values
• Step 1 is assigned the name "columntransformer" (all lowercase), and step 2 is assigned the name "logisticregression"
In [26]:
pipe.fit(X, y)

Out[26]:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='passthrough',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('onehotencoder',
OneHotEncoder(categories='auto',
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='error',
sparse=True),
['Embarked', 'Sex'])],
verbose=False)),
('logisticregression',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None,
penalty='l2', random_state=1,
solver='liblinear', tol=0.0001, verbose=0,
warm_start=False))],
verbose=False)

This is what happens "under the hood" when you fit the Pipeline:

In [27]:
logreg.fit(ct.fit_transform(X), y)

Out[27]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=1, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)

You can select the steps of a Pipeline by name in order to inspect them:

• These are the 7 coefficients of the logistic regression model
In [28]:
pipe.named_steps.logisticregression.coef_

Out[28]:
array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
0.20056557,  0.01597307]])

Update X_new to have the same columns as X:

In [29]:
X_new = df_new[cols]
X_new

Out[29]:
Parch Fare Embarked Sex
0 0 7.8292 Q male
1 0 7.0000 S female
2 0 9.6875 Q male
3 0 8.6625 S male
4 1 12.2875 S female
5 0 9.2250 S male
6 0 7.6292 Q female
7 1 29.0000 S male
8 0 7.2292 C female
9 0 24.1500 S male

Use the fitted Pipeline to make predictions for X_new:

In [30]:
pipe.predict(X_new)

Out[30]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

This is what happens "under the hood" when you make predictions using the Pipeline:

• It uses "transform" rather than "fit_transform" so that the exact encoding scheme learned from the training data (during the "fit" step) will be applied to the new data
In [31]:
logreg.predict(ct.transform(X_new))

Out[31]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

## Recap¶

This is all of the code that is necessary to recreate our workflow up to this point:

• You can copy/paste this code from http://bit.ly/basic-pipeline
• There are no calls to "fit_transform" or "transform" because all of that functionality is encapsulated by the Pipeline
In [32]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [33]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

In [34]:
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']

In [35]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]

In [36]:
ohe = OneHotEncoder()

In [37]:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
remainder='passthrough')

In [38]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [39]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

Out[39]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Summary of our ColumnTransformer:

• It selected 2 categorical columns and transformed them, resulting in 5 columns
• It selected 2 numerical columns and did nothing to them, resulting in 2 columns
• It stacked the 7 columns side-by-side

Summary of our Pipeline:

• Step 1 transformed the data from 4 columns to 7 columns using ColumnTransformer
• Step 2 used a LogisticRegression model for fitting and predicting

Comparing Pipeline and ColumnTransformer:

• ColumnTransformer pulls out subsets of columns and transforms them, and then stacks the results side-by-side
• Pipeline is a series of steps that occur in order, and the output of each step passes to the next step

## Part 4: Encoding text data¶

We want to use "Name" as an additional feature:

• It can't be directly passed to the model because it isn't numeric
In [40]:
df

Out[40]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Use CountVectorizer to convert text into a matrix of token counts:

• Use single brackets around "Name" to pass a Series, because CountVectorizer expects 1-dimensional input
• Outputs a document-term matrix containing 10 rows (one for each name) and 40 columns (one for each unique word)
In [41]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
dtm = vect.fit_transform(df['Name'])
dtm

Out[41]:
<10x40 sparse matrix of type '<class 'numpy.int64'>'
with 46 stored elements in Compressed Sparse Row format>

Examine the feature names:

• It found 40 unique words in the "Name" Series after lowercasing the words, removing punctuation, and removing words that were only 1 character long
In [42]:
print(vect.get_feature_names())

['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']


Examine the document-term matrix as a DataFrame:

• In each row, CountVectorizer counted how many times each word appeared
• For example, the first row contains 36 zeros and 4 ones (under "braund", "mr", "owen", and "harris")
• This encoding is known as the "Bag of Words" representation
• From each of the 40 features, the model can learn the relationship between the target value and how many times that word appeared in each passenger's name
In [43]:
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

Out[43]:
achem adele allen berg bradley braund briggs cumings elisabeth florence ... nasser nicholas oscar owen palsson peel thayer timothy vilhelmina william
0 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
1 0 0 0 0 1 0 1 1 0 1 ... 0 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0
9 1 1 0 0 0 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 0

10 rows × 40 columns

Update X to include the "Name" column:

In [44]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X

Out[44]:
Parch Fare Embarked Sex Name
0 0 7.2500 S male Braund, Mr. Owen Harris
1 0 71.2833 C female Cumings, Mrs. John Bradley (Florence Briggs Th...
2 0 7.9250 S female Heikkinen, Miss. Laina
3 0 53.1000 S female Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 0 8.0500 S male Allen, Mr. William Henry
5 0 8.4583 Q male Moran, Mr. James
6 0 51.8625 S male McCarthy, Mr. Timothy J
7 1 21.0750 S male Palsson, Master. Gosta Leonard
8 2 11.1333 S female Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 0 30.0708 C female Nasser, Mrs. Nicholas (Adele Achem)

Update the ColumnTransformer:

• Add another tuple to specify that CountVectorizer should be applied to the "Name" column
• There are no brackets around "Name" because CountVectorizer expects 1-dimensional input
In [45]:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')


Perform the transformation:

• Output contains 47 columns in this order: 3 columns for "Embarked", 2 for "Sex", 40 for "Name", 1 for "Parch", and 1 for "Fare"
In [46]:
ct.fit_transform(X)

Out[46]:
<10x47 sparse matrix of type '<class 'numpy.float64'>'
with 78 stored elements in Compressed Sparse Row format>

Update the Pipeline to contain the modified ColumnTransformer:

In [47]:
pipe = make_pipeline(ct, logreg)


Fit the Pipeline and examine the steps:

In [48]:
pipe.fit(X, y)
pipe.named_steps

Out[48]:
{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
transformer_weights=None,
transformers=[('onehotencoder',
OneHotEncoder(categories='auto', drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='error',
sparse=True),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(analyzer='word', binary=False,
decode_error='strict',
dtype=<class 'numpy.int64'>,
encoding='utf-8',
input='content',
lowercase=True, max_df=1.0,
max_features=None, min_df=1,
ngram_range=(1, 1),
preprocessor=None,
stop_words=None,
strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None,
vocabulary=None),
'Name')],
verbose=False),
'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=1, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)}

Update X_new to include the "Name" column:

In [49]:
X_new = df_new[cols]


Use the fitted Pipeline to make predictions for X_new:

In [50]:
pipe.predict(X_new)

Out[50]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])