Office Hours session 1

Recap of Lesson 1 (copy from here: http://bit.ly/first-ml-lesson)

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
In [2]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
In [3]:
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']
In [4]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]
In [5]:
ohe = OneHotEncoder()
vect = CountVectorizer()
In [6]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')
In [7]:
logreg = LogisticRegression(solver='liblinear', random_state=1)
In [8]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
Out[8]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Anna: Why is it important to select a Series instead of a DataFrame for the target variable?

In [9]:
df
Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Selecting a Series versus a DataFrame:

In [10]:
df['Survived']
Out[10]:
0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64
In [11]:
df[['Survived']]
Out[11]:
Survived
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 1
9 1

Series is 1D, DataFrame is 2D:

In [12]:
df['Survived'].shape
Out[12]:
(10,)
In [13]:
df[['Survived']].shape
Out[13]:
(10, 1)

Use to_numpy() to convert a pandas object to a NumPy array:

In [14]:
df['Survived'].to_numpy()
Out[14]:
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])
In [15]:
df[['Survived']].to_numpy()
Out[15]:
array([[0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1]])
  • 1D target is used for classification problems with a single label
  • 2D target is used for multilabel classification problems

Shashi: Can you explain the solver and random_state parameters of logistic regression?

In [16]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

Solver:

random_state:

  • Use it to ensure reproducibility any time you have a pseudo-random process
  • Set it to any integer

Hause: Is the intercept included in the list of logistic regression coefficients?

No, it's stored separately in the "intercept_" attribute.

VK: Why did you use LogisticRegression instead of LogisticRegressionCV, which has built-in cross-validation?

I prefer not to use super-specialized functions like LogisticRegressionCV, and instead use GridSearchCV which works with any model and integrates well into the scikit-learn workflow.

Arjun: Is there a reason that you didn't do cross-validation after each change that you made to your model?

Normally you would do cross-validation after each change, but in this case cross-validation scores would have been highly misleading due to the dataset size.

Anton: Do you always fit the model on the entire dataset after setting the hyperparameters via cross-validation?

Yes, you fit the model to all samples for which you know the target value, otherwise you are throwing away useful training data.

Rachel: How do I choose the right "scoring" parameter?

Arjun: When one-hot encoding, what happens if the testing data has a new category that was not in the training data?

Use fit_transform on training data:

In [17]:
demo_train = pd.DataFrame({'letter':['A', 'B', 'C', 'B']})
demo_train
Out[17]:
letter
0 A
1 B
2 C
3 B
In [18]:
ohe = OneHotEncoder(sparse=False)
In [19]:
ohe.fit_transform(demo_train[['letter']])
Out[19]:
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

If you use fit_transform on testing data, it won't learn the same categories, which will be problematic:

In [20]:
demo_test = pd.DataFrame({'letter':['A', 'C', 'A']})
demo_test
Out[20]:
letter
0 A
1 C
2 A
In [21]:
ohe.fit_transform(demo_test[['letter']])
Out[21]:
array([[1., 0.],
       [0., 1.],
       [1., 0.]])

Always use fit_transform on training data and transform (only) on testing data so that categories will be represented the same way:

In [22]:
ohe.fit_transform(demo_train[['letter']])
Out[22]:
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])
In [23]:
ohe.transform(demo_test[['letter']])
Out[23]:
array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

If testing data contains a new category, the encoder will error during transformation:

In [24]:
demo_test_unknown = pd.DataFrame({'letter':['A', 'C', 'D']})
demo_test_unknown
Out[24]:
letter
0 A
1 C
2 D
In [25]:
# ohe.transform(demo_test_unknown[['letter']])

The solution is to tell the encoder to represent unknown categories as all zeros:

In [26]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
In [27]:
ohe.fit_transform(demo_train[['letter']])
Out[27]:
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])
In [28]:
ohe.transform(demo_test_unknown[['letter']])
Out[28]:
array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

Advice:

  • By default, keep handle_unknown='error' so you know if you are encountering new categories
  • If you encounter new categories, set handle_unknown='ignore' but keep in mind that all unknown categories will be encoded the same way
  • As soon as possible, retrain your model with data that includes any new categories

Chris: Should we drop one of the one-hot encoded features, since some models (such as linear regression) don't like when there is collinearity between features?

Here is the default one-hot encoding:

In [29]:
demo_train
Out[29]:
letter
0 A
1 B
2 C
3 B
In [30]:
ohe.fit_transform(demo_train[['letter']])
Out[30]:
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

You can also drop the first level (new in version 0.21):

In [31]:
ohe = OneHotEncoder(sparse=False, drop='first')
In [32]:
ohe.fit_transform(demo_train[['letter']])
Out[32]:
array([[0., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.]])
  • Drop the first level when you know perfectly collinear features will cause problems
  • For most models, dropping the first level won't improve performance
  • Dropping the first level is incompatible with ignoring unknown categories
  • Dropping the first level is likely problematic if you scale your features or use a regularized model

Paolo: What encoding should I use with an ordinal feature?

If you have an ordinal feature (categorical feature with a logical ordering) that is already encoded numerically (such as Pclass), then leave it as-is:

In [33]:
df
Out[33]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

If you have an ordinal feature that is encoded as strings, then use OrdinalEncoder:

  • You define the logical order of the categories
  • Each input feature becomes a single output feature (unlike OneHotEncoder)
In [34]:
df_ordinal = pd.DataFrame({'Class': ['third', 'first', 'second', 'third'],
                           'Size': ['S', 'S', 'L', 'XL']})
df_ordinal
Out[34]:
Class Size
0 third S
1 first S
2 second L
3 third XL
In [35]:
from sklearn.preprocessing import OrdinalEncoder
ore = OrdinalEncoder(categories=[['first', 'second', 'third'], ['S', 'M', 'L', 'XL']])
ore.fit_transform(df_ordinal)
Out[35]:
array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [2., 3.]])

Vijey: What's the difference between OneHotEncoder, OrdinalEncoder, and LabelEncoder?

  • Use OneHotEncoder for unordered categorical features (nominal)
  • Use OrdinalEncoder for ordered categorical features (ordinal)
  • LabelEncoder is only for labels (meaning targets), and is rarely useful any more since scikit-learn classification models can handle string-based labels

Shreyas: In a ColumnTransformer, what are the other options for "remainder"?

"passthrough" means include all unspecified columns but don't modify them:

In [36]:
ohe = OneHotEncoder()
In [37]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')
In [38]:
X.columns
Out[38]:
Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')
In [39]:
ct.fit_transform(X)
Out[39]:
<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

"drop" means drop all unspecified columns:

In [40]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='drop')
In [41]:
ct.fit_transform(X)
Out[41]:
<10x45 sparse matrix of type '<class 'numpy.float64'>'
	with 66 stored elements in Compressed Sparse Row format>

You can also set remainder to be a transformer object, in which case all unspecified columns will be transformed.

Hause: How do I get the column names for the output of ColumnTransformer?

get_feature_names works in some cases:

In [42]:
ct.get_feature_names()
Out[42]:
['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'countvectorizer__achem',
 'countvectorizer__adele',
 'countvectorizer__allen',
 'countvectorizer__berg',
 'countvectorizer__bradley',
 'countvectorizer__braund',
 'countvectorizer__briggs',
 'countvectorizer__cumings',
 'countvectorizer__elisabeth',
 'countvectorizer__florence',
 'countvectorizer__futrelle',
 'countvectorizer__gosta',
 'countvectorizer__harris',
 'countvectorizer__heath',
 'countvectorizer__heikkinen',
 'countvectorizer__henry',
 'countvectorizer__jacques',
 'countvectorizer__james',
 'countvectorizer__john',
 'countvectorizer__johnson',
 'countvectorizer__laina',
 'countvectorizer__leonard',
 'countvectorizer__lily',
 'countvectorizer__master',
 'countvectorizer__may',
 'countvectorizer__mccarthy',
 'countvectorizer__miss',
 'countvectorizer__moran',
 'countvectorizer__mr',
 'countvectorizer__mrs',
 'countvectorizer__nasser',
 'countvectorizer__nicholas',
 'countvectorizer__oscar',
 'countvectorizer__owen',
 'countvectorizer__palsson',
 'countvectorizer__peel',
 'countvectorizer__thayer',
 'countvectorizer__timothy',
 'countvectorizer__vilhelmina',
 'countvectorizer__william']

get_feature_names will not (yet) work with a passthrough transformer:

In [43]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')
In [44]:
ct.fit_transform(X)
Out[44]:
<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>
In [45]:
# ct.get_feature_names()

In that case, you will have to inspect the transformers one-by-one to figure out the column names:

In [46]:
ct.transformers_
Out[46]:
[('onehotencoder',
  OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
                handle_unknown='error', sparse=True),
  ['Embarked', 'Sex']),
 ('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                  lowercase=True, max_df=1.0, max_features=None, min_df=1,
                  ngram_range=(1, 1), preprocessor=None, stop_words=None,
                  strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, vocabulary=None),
  'Name'),
 ('remainder', 'passthrough', [0, 1])]
In [47]:
ct.named_transformers_.onehotencoder.get_feature_names()
Out[47]:
array(['x0_C', 'x0_Q', 'x0_S', 'x1_female', 'x1_male'], dtype=object)
In [48]:
ct.named_transformers_.countvectorizer.get_feature_names()
Out[48]:
['achem',
 'adele',
 'allen',
 'berg',
 'bradley',
 'braund',
 'briggs',
 'cumings',
 'elisabeth',
 'florence',
 'futrelle',
 'gosta',
 'harris',
 'heath',
 'heikkinen',
 'henry',
 'jacques',
 'james',
 'john',
 'johnson',
 'laina',
 'leonard',
 'lily',
 'master',
 'may',
 'mccarthy',
 'miss',
 'moran',
 'mr',
 'mrs',
 'nasser',
 'nicholas',
 'oscar',
 'owen',
 'palsson',
 'peel',
 'thayer',
 'timothy',
 'vilhelmina',
 'william']
In [49]:
X.columns
Out[49]:
Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')

Arjun: Is there a more efficient way to specify columns for a ColumnTransformer than listing them one-by-one?

You can specify columns by position or slice:

In [50]:
ct = make_column_transformer(
    (ohe, [2, 3]),
    (vect, 4),
    remainder='passthrough')
In [51]:
ct = make_column_transformer(
    (ohe, slice(2, 4)),
    (vect, 4),
    remainder='passthrough')

make_column_selector (new in version 0.22) allows you to select columns by regex pattern or data type:

In [52]:
from sklearn.compose import make_column_selector
cs = make_column_selector(pattern='E|S')
In [53]:
ct = make_column_transformer(
    (ohe, cs),
    (vect, 4),
    remainder='passthrough')

Tony: Does pipe.fit() modify the underlying objects (ct, logreg)?

Yes, it does modify the underlying objects:

In [54]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);
In [55]:
pipe.named_steps.logisticregression.coef_
Out[55]:
array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])
In [56]:
logreg.coef_
Out[56]:
array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

Arjun: Regarding the code "pipe.namedsteps.logisticregression.coef", can you explain what it is doing and why it references "logisticregression" rather than "logreg"?

This is a two-step pipeline:

In [57]:
pipe
Out[57]:
Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  <sklearn.compose._column_transformer.make_column_selector object...
                                                                  token_pattern='(?u)\\b\\w\\w+\\b',
                                                                  tokenizer=None,
                                                                  vocabulary=None),
                                                  4)],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=1,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

The step names were assigned by make_pipeline, and you can examine individual steps via the named_steps attribute:

In [58]:
pipe.named_steps.keys()
Out[58]:
dict_keys(['columntransformer', 'logisticregression'])
In [59]:
pipe.named_steps.columntransformer
Out[59]:
ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categories='auto', drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               sparse=True),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7fb95ade1b50>),
                                ('countvectorizer',
                                 CountVectorizer(analyzer='word', binary=False,
                                                 decode_error='strict',
                                                 dtype=<class 'numpy.int64'>,
                                                 encoding='utf-8',
                                                 input='content',
                                                 lowercase=True, max_df=1.0,
                                                 max_features=None, min_df=1,
                                                 ngram_range=(1, 1),
                                                 preprocessor=None,
                                                 stop_words=None,
                                                 strip_accents=None,
                                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                                 tokenizer=None,
                                                 vocabulary=None),
                                 4)],
                  verbose=False)
In [60]:
pipe.named_steps.logisticregression
Out[60]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
In [61]:
pipe.named_steps.logisticregression.coef_
Out[61]:
array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

Here are alternative ways to accomplish the same thing:

In [62]:
pipe.named_steps['logisticregression'].coef_
Out[62]:
array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])
In [63]:
pipe['logisticregression'].coef_
Out[63]:
array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])
In [64]:
pipe[1].coef_
Out[64]:
array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

Arjun: What's the difference between make_pipeline and Pipeline?

make_pipeline:

  • Assigns default step names (lowercase version of the step's class name)
  • Easier to read and write than Pipeline code
In [65]:
pipe = make_pipeline(ct, logreg)
pipe.named_steps.keys()
Out[65]:
dict_keys(['columntransformer', 'logisticregression'])

Pipeline:

  • Requires you to assign step names
  • Custom step names can be useful for clarity when you are doing a grid search
In [66]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe.named_steps.keys()
Out[66]:
dict_keys(['preprocessor', 'classifier'])

Hause: Can you walk us through the documentation for Pipeline and ColumnTransformer?

Five pages/page types you need to be familiar with:

  1. API reference: high-level view
  2. Class documentation: detailed view of a class
  3. User guide: more examples and advice
  4. Examples: more complex examples
  5. Glossary: glossary of terms

Cathal: Is there a pandas method "pdpipe" that does something similar to scikit-learn's Pipeline?

  • pipe is a pandas method for including user-defined functions in a pandas method chain
  • pdpipe is a library for writing pandas code using an API that is similar to scikit-learn's Pipeline

Abla: Can you build feature interactions in a Pipeline?

Yes, using the PolynomialFeatures class, though I don't usually do so:

  • It doesn't scale well if you have lots of features
  • I prefer to use tree-based models that can learn feature interactions on their own

Charles: Why does CountVectorizer expect 1D input instead of 2D input?

Compare OneHotEncoder to CountVectorizer:

  • Most transformers (like OneHotEncoder) expect 2D input
  • CountVectorizer expects 1D input
In [67]:
ohe.fit_transform(X[['Embarked']])
Out[67]:
<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>
In [68]:
vect.fit_transform(X['Name'])
Out[68]:
<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>
In [69]:
ct = make_column_transformer(
    (ohe, ['Embarked']),
    (vect, 'Name'))
In [70]:
ct.fit_transform(X)
Out[70]:
<10x43 sparse matrix of type '<class 'numpy.float64'>'
	with 56 stored elements in Compressed Sparse Row format>

One possible reason: CountVectorizer isn't built to accept more than one column as input, thus it doesn't make sense for it to allow 2D input.

VK: How do I pass multiple columns to CountVectorizer?

Pass them in as two separate tuples:

In [71]:
ct = make_column_transformer(
    (vect, 'Name'),
    (vect, 'Sex'))
In [72]:
ct.fit_transform(X)
Out[72]:
<10x42 sparse matrix of type '<class 'numpy.longlong'>'
	with 56 stored elements in Compressed Sparse Row format>

make_column_transformer can't assign both of them the same name, so it appends numbers at the end:

In [73]:
ct.named_transformers_.keys()
Out[73]:
dict_keys(['countvectorizer-1', 'countvectorizer-2', 'remainder'])

Motasem: Would the document-term matrix have values greater than 1 if a word is repeated in a row?

Yes:

In [74]:
text = ['Machine Learning is fun', 'I am learning Machine Learning']
In [75]:
pd.DataFrame(vect.fit_transform(text).toarray(), columns=vect.get_feature_names())
Out[75]:
am fun is learning machine
0 0 1 1 1 1
1 1 0 0 2 1

Anna: What does "stored elements" mean in the sparse matrix?

Stored elements is the number of non-zero values:

In [76]:
dtm = vect.fit_transform(text)
dtm
Out[76]:
<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>
In [77]:
print(dtm)
  (0, 4)	1
  (0, 3)	1
  (0, 2)	1
  (0, 1)	1
  (1, 4)	1
  (1, 3)	2
  (1, 0)	1

Khaled: What happens if there are words in the testing set that didn't appear in the training set?

New words in the testing set will be ignored:

In [78]:
vect.get_feature_names()
Out[78]:
['am', 'fun', 'is', 'learning', 'machine']
In [79]:
vect.transform(['Data Science is FUN!']).toarray()
Out[79]:
array([[0, 1, 1, 0, 0]])

Anton: Once I've built a model (or pipeline) that I'm happy with, how can I save it so that I can use it later to make predictions?

Reset our pipeline:

In [80]:
ohe = OneHotEncoder()
vect = CountVectorizer()
In [81]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')
In [82]:
logreg = LogisticRegression(solver='liblinear', random_state=1)
In [83]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);

You can save it to a file using pickle:

In [84]:
import pickle
In [85]:
with open('pipe.pickle', 'wb') as f:
    pickle.dump(pipe, f)
In [86]:
with open('pipe.pickle', 'rb') as f:
    pipe_from_pickle = pickle.load(f)
In [87]:
pipe_from_pickle.predict(X_new)
Out[87]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

You can save it to a file using joblib (which is more efficient than pickle for scikit-learn objects):

In [88]:
import joblib
In [89]:
joblib.dump(pipe, 'pipe.joblib')
Out[89]:
['pipe.joblib']
In [90]:
pipe_from_joblib = joblib.load('pipe.joblib')
In [91]:
pipe_from_joblib.predict(X_new)
Out[91]:
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

For both pickle and joblib objects:

  • You should only load it into an identical environment
  • You should only load objects you trust

Other alternatives are available that don't save the full model object, but do save a representation that can be used to make predictions:

Darnell: Will you be giving us homework or exercises in the course?

There won't be any in the Live Course, but there may be some exercises or walkthroughs in the Advanced Course.

Khaled: You mentioned that there will be an "Advanced Course" after this one. How can I register for it?

  • If you purchased the "Live Course + Advanced Course" bundle, you will get automatic access to the Advanced Course when it's released
  • If you did not purchase the bundle, please email me if you would like to purchase the upgrade