Course: Building an Effective ML Workflow with scikit-learn

Outline:

  • Review of the basic Machine Learning workflow
  • Encoding categorical data
  • Using ColumnTransformer and Pipeline
  • Recap
  • Encoding text data

Part 1: Review of the basic Machine Learning workflow

Check your scikit-learn version:

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.22.1'

Load 10 rows from the famous Titanic dataset:

In [2]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)

Basic terminology:

  • "Survived" is the target column
  • Target is categorical, thus it's a classification problem
  • All other columns are possible features
  • Each row is an observation, and represents a passenger
  • This is our training data because we know the target values
In [3]:
df
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

We want to use "Parch" and "Fare" as initial features:

  • "Parch" is the number of parents or children aboard with each passenger
  • "Fare" is the amount they paid
  • Both are numeric

Define X and y:

  • X is the feature matrix
  • y is the target
In [4]:
X = df[['Parch', 'Fare']]
X
Out[4]:
Parch Fare
0 0 7.2500
1 0 71.2833
2 0 7.9250
3 0 53.1000
4 0 8.0500
5 0 8.4583
6 0 51.8625
7 1 21.0750
8 2 11.1333
9 0 30.0708
In [5]:
y = df['Survived']
y
Out[5]:
0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

Check the object shapes:

  • X is a pandas DataFrame with 2 columns, thus it has 2 dimensions
  • y is a pandas Series, thus it has 1 dimension
In [6]:
X.shape
Out[6]:
(10, 2)
In [7]:
y.shape
Out[7]:
(10,)

Create a model object:

  • Set the "solver" to increase the likelihood that we will all get the same results
  • Set the "random_state" for reproducibility
In [8]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)

Evaluate the model using cross-validation:

  • Our goal is to simulate model performance on future data so that we can choose between models
  • Evaluation metric is classification accuracy
  • "cross_val_score" does the dataset splitting, training, predictions, and evaluation
  • Your results may differ based on your scikit-learn version
  • We can't take these results seriously because the dataset is tiny
In [9]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()
Out[9]:
0.6944444444444443

Train the model on the entire dataset:

In [10]:
logreg.fit(X, y)
Out[10]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Read in a new dataset for which we don't know the target values:

In [11]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
df_new
Out[11]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S

Define X_new to have the same columns as X:

In [12]:
X_new = df_new[['Parch', 'Fare']]
X_new
Out[12]:
Parch Fare
0 0 7.8292
1 0 7.0000
2 0 9.6875
3 0 8.6625
4 1 12.2875
5 0 9.2250
6 0 7.6292
7 1 29.0000
8 0 7.2292
9 0 24.1500

Use the trained model to make predictions for X_new:

In [13]:
logreg.predict(X_new)
Out[13]:
array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

Part 2: Encoding categorical data

We want to use "Embarked" and "Sex" as additional features:

  • "Embarked" is the port they embarked from
  • "Sex" is male or female
  • They are unordered categorical features
  • They can't be directly passed to the model because they aren't numeric
In [14]:
df
Out[14]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Encode "Embarked" using one-hot encoding:

  • This is the same as "dummy encoding"
  • Outputs a sparse matrix, which is more efficient and performant when most values in a matrix are zeros
  • Use two brackets around "Embarked" to pass a DataFrame instead of a Series
In [15]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])
Out[15]:
<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

Ask for a dense (not sparse) matrix so that we can examine the encoding:

  • There are 3 columns because there were 3 unique values in "Embarked"
  • Each row contains a single 1
  • 100 means "C", 010 means "Q", 001 means "S"
  • The categories are listed in alphabetical order in the "categories_" attribute
  • You can think of "categories_" as the column headings for the matrix
  • From each of the three features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port
In [16]:
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked']])
Out[16]:
array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])
In [17]:
ohe.categories_
Out[17]:
[array(['C', 'Q', 'S'], dtype=object)]

What's the difference between "fit" and "transform"?

  • OneHotEncoder is a "transformer", meaning its role is data transformations
  • Transformers usually have a "fit" method and always have a "transform" method
  • For all transformers: "fit" is when they learn something, and "transform" is when they use what they learned to do the transformation
  • For OneHotEncoder: "fit" is when it learns the categories, and "transform" is when it creates the matrix using those categories
  • If you are going to "fit" and "transform", then you should do it in a single step using "fit_transform"

Encode "Embarked" and "Sex" at the same time:

  • First 3 columns represent "Embarked" and last two columns represent "Sex"
  • For the "Sex" columns: 10 means "female", 01 means "male"
In [18]:
ohe.fit_transform(df[['Embarked', 'Sex']])
Out[18]:
array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])
In [19]:
ohe.categories_
Out[19]:
[array(['C', 'Q', 'S'], dtype=object), array(['female', 'male'], dtype=object)]

How could we include "Embarked" and "Sex" in the model along with "Parch" and "Fare"?

  • Stack the 2 numeric features side-by-side with the 5 encoded columns, and then train the model with all 7 columns
  • However, we would need to repeat the same process (encoding and stacking) with the new data before making predictions
  • Doing this manually is inefficient and error-prone, and the complexity will only increase as you preprocess additional columns

Part 3: Using ColumnTransformer and Pipeline

Goals:

  • Use ColumnTransformer to make it easy to apply different preprocessing to different columns
  • Use Pipeline to make it easy to apply the same workflow to training data and new data

Create a list of columns and use that to update X:

In [20]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
In [21]:
X = df[cols]
X
Out[21]:
Parch Fare Embarked Sex
0 0 7.2500 S male
1 0 71.2833 C female
2 0 7.9250 S female
3 0 53.1000 S female
4 0 8.0500 S male
5 0 8.4583 Q male
6 0 51.8625 S male
7 1 21.0750 S male
8 2 11.1333 S female
9 0 30.0708 C female

Create an instance of OneHotEncoder with the default options:

In [22]:
ohe = OneHotEncoder()

Create a ColumnTransformer:

  • First argument (a tuple) specifies that we want to one-hot encode the "Embarked" and "Sex" columns
  • "remainder" argument specifies that we want to keep all other columns in the final output (without modifying them)
In [23]:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')

Perform the transformation:

  • Output contains 7 columns in this order: 3 columns for "Embarked", 2 for "Sex", 1 for "Parch", and 1 for "Fare"
  • Column order is the order in which you listed them in the ColumnTransformer followed by any you passthrough
In [24]:
ct.fit_transform(X)
Out[24]:
array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

Use Pipeline to chain together sequential steps:

  • Step 1 is data preprocessing using ColumnTransformer
  • Step 2 is model building using LogisticRegression
In [25]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)

Fit the Pipeline:

  • Step 1: X gets transformed from 4 columns to 7 columns by ColumnTransformer
  • Step 2: LogisticRegression model gets fit, thus it learns the relationship between those 7 features and the y values
  • Step 1 is assigned the name "columntransformer" (all lowercase), and step 2 is assigned the name "logisticregression"
In [26]:
pipe.fit(X, y)
Out[26]:
Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Embarked', 'Sex'])],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=1,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

This is what happens "under the hood" when you fit the Pipeline:

In [27]:
logreg.fit(ct.fit_transform(X), y)
Out[27]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

You can select the steps of a Pipeline by name in order to inspect them:

  • These are the 7 coefficients of the logistic regression model
In [28]:
pipe.named_steps.logisticregression.coef_
Out[28]:
array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

Update X_new to have the same columns as X:

In [29]:
X_new = df_new[cols]
X_new
Out[29]:
Parch Fare Embarked Sex
0 0 7.8292 Q male
1 0 7.0000 S female
2 0 9.6875 Q male
3 0 8.6625 S male
4 1 12.2875 S female
5 0 9.2250 S male
6 0 7.6292 Q female
7 1 29.0000 S male
8 0 7.2292 C female
9 0 24.1500 S male

Use the fitted Pipeline to make predictions for X_new:

In [30]:
pipe.predict(X_new)
Out[30]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

This is what happens "under the hood" when you make predictions using the Pipeline:

  • It uses "transform" rather than "fit_transform" so that the exact encoding scheme learned from the training data (during the "fit" step) will be applied to the new data
In [31]:
logreg.predict(ct.transform(X_new))
Out[31]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Recap

This is all of the code that is necessary to recreate our workflow up to this point:

  • You can copy/paste this code from http://bit.ly/basic-pipeline
  • There are no calls to "fit_transform" or "transform" because all of that functionality is encapsulated by the Pipeline
In [32]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
In [33]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
In [34]:
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']
In [35]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]
In [36]:
ohe = OneHotEncoder()
In [37]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')
In [38]:
logreg = LogisticRegression(solver='liblinear', random_state=1)
In [39]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
Out[39]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Summary of our ColumnTransformer:

  • It selected 2 categorical columns and transformed them, resulting in 5 columns
  • It selected 2 numerical columns and did nothing to them, resulting in 2 columns
  • It stacked the 7 columns side-by-side

Summary of our Pipeline:

  • Step 1 transformed the data from 4 columns to 7 columns using ColumnTransformer
  • Step 2 used a LogisticRegression model for fitting and predicting

Comparing Pipeline and ColumnTransformer:

  • ColumnTransformer pulls out subsets of columns and transforms them, and then stacks the results side-by-side
  • Pipeline is a series of steps that occur in order, and the output of each step passes to the next step

Part 4: Encoding text data

We want to use "Name" as an additional feature:

  • It can't be directly passed to the model because it isn't numeric
In [40]:
df
Out[40]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Use CountVectorizer to convert text into a matrix of token counts:

  • Use single brackets around "Name" to pass a Series, because CountVectorizer expects 1-dimensional input
  • Outputs a document-term matrix containing 10 rows (one for each name) and 40 columns (one for each unique word)
In [41]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
dtm = vect.fit_transform(df['Name'])
dtm
Out[41]:
<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

Examine the feature names:

  • It found 40 unique words in the "Name" Series after lowercasing the words, removing punctuation, and removing words that were only 1 character long
In [42]:
print(vect.get_feature_names())
['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']

Examine the document-term matrix as a DataFrame:

  • In each row, CountVectorizer counted how many times each word appeared
  • For example, the first row contains 36 zeros and 4 ones (under "braund", "mr", "owen", and "harris")
  • This encoding is known as the "Bag of Words" representation
  • From each of the 40 features, the model can learn the relationship between the target value and how many times that word appeared in each passenger's name
In [43]:
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
Out[43]:
achem adele allen berg bradley braund briggs cumings elisabeth florence ... nasser nicholas oscar owen palsson peel thayer timothy vilhelmina william
0 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
1 0 0 0 0 1 0 1 1 0 1 ... 0 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0
9 1 1 0 0 0 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 0

10 rows × 40 columns

Update X to include the "Name" column:

In [44]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X
Out[44]:
Parch Fare Embarked Sex Name
0 0 7.2500 S male Braund, Mr. Owen Harris
1 0 71.2833 C female Cumings, Mrs. John Bradley (Florence Briggs Th...
2 0 7.9250 S female Heikkinen, Miss. Laina
3 0 53.1000 S female Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 0 8.0500 S male Allen, Mr. William Henry
5 0 8.4583 Q male Moran, Mr. James
6 0 51.8625 S male McCarthy, Mr. Timothy J
7 1 21.0750 S male Palsson, Master. Gosta Leonard
8 2 11.1333 S female Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 0 30.0708 C female Nasser, Mrs. Nicholas (Adele Achem)

Update the ColumnTransformer:

  • Add another tuple to specify that CountVectorizer should be applied to the "Name" column
  • There are no brackets around "Name" because CountVectorizer expects 1-dimensional input
In [45]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

Perform the transformation:

  • Output contains 47 columns in this order: 3 columns for "Embarked", 2 for "Sex", 40 for "Name", 1 for "Parch", and 1 for "Fare"
In [46]:
ct.fit_transform(X)
Out[46]:
<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

Update the Pipeline to contain the modified ColumnTransformer:

In [47]:
pipe = make_pipeline(ct, logreg)

Fit the Pipeline and examine the steps:

In [48]:
pipe.fit(X, y)
pipe.named_steps
Out[48]:
{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex']),
                                 ('countvectorizer',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=<class 'numpy.int64'>,
                                                  encoding='utf-8',
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                                  max_features=None, min_df=1,
                                                  ngram_range=(1, 1),
                                                  preprocessor=None,
                                                  stop_words=None,
                                                  strip_accents=None,
                                                  token_pattern='(?u)\\b\\w\\w+\\b',
                                                  tokenizer=None,
                                                  vocabulary=None),
                                  'Name')],
                   verbose=False),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                    warm_start=False)}

Update X_new to include the "Name" column:

In [49]:
X_new = df_new[cols]

Use the fitted Pipeline to make predictions for X_new:

In [50]:
pipe.predict(X_new)
Out[50]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])