Office Hours session 1¶

Recap of Lesson 1 (copy from here: http://bit.ly/first-ml-lesson)¶

In [1]:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [2]:

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']

In [3]:

df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']

In [4]:

df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]

In [5]:

ohe = OneHotEncoder()
vect = CountVectorizer()

In [6]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

In [7]:

logreg = LogisticRegression(solver='liblinear', random_state=1)

In [8]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

Out[8]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Anna: Why is it important to select a Series instead of a DataFrame for the target variable?¶

In [9]:

df

Out[9]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Selecting a Series versus a DataFrame:

In [10]:

df['Survived']

Out[10]:

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Survived, dtype: int64

In [11]:

df[['Survived']]

Out[11]:

	Survived
0	0
1	1
2	1
3	1
4	0
5	0
6	0
7	0
8	1
9	1

Series is 1D, DataFrame is 2D:

In [12]:

df['Survived'].shape

Out[12]:

(10,)

In [13]:

df[['Survived']].shape

Out[13]:

(10, 1)

Use to_numpy() to convert a pandas object to a NumPy array:

In [14]:

df['Survived'].to_numpy()

Out[14]:

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])

In [15]:

df[['Survived']].to_numpy()

Out[15]:

array([[0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1]])

1D target is used for classification problems with a single label
2D target is used for multilabel classification problems

Shashi: Can you explain the solver and random_state parameters of logistic regression?¶

In [16]:

logreg = LogisticRegression(solver='liblinear', random_state=1)

Solver:

Calculates the coefficients
Solvers have different strengths and weaknesses, see comparison here: Logistic regression section of scikit-learn User Guide
If you get a convergence warning, try changing the solver

random_state:

Use it to ensure reproducibility any time you have a pseudo-random process
Set it to any integer

Hause: Is the intercept included in the list of logistic regression coefficients?¶

No, it's stored separately in the "intercept_" attribute.

VK: Why did you use LogisticRegression instead of LogisticRegressionCV, which has built-in cross-validation?¶

I prefer not to use super-specialized functions like LogisticRegressionCV, and instead use GridSearchCV which works with any model and integrates well into the scikit-learn workflow.

Arjun: Is there a reason that you didn't do cross-validation after each change that you made to your model?¶

Normally you would do cross-validation after each change, but in this case cross-validation scores would have been highly misleading due to the dataset size.

Anton: Do you always fit the model on the entire dataset after setting the hyperparameters via cross-validation?¶

Yes, you fit the model to all samples for which you know the target value, otherwise you are throwing away useful training data.

Rachel: How do I choose the right "scoring" parameter?¶

Figure out what is important to you, and then choose an evaluation metric that matches those priorities.

Examples:

spam filter: optimize for precision
fraud detector: optimize for recall

Recommended resources:

Arjun: When one-hot encoding, what happens if the testing data has a new category that was not in the training data?¶

Use fit_transform on training data:

In [17]:

demo_train = pd.DataFrame({'letter':['A', 'B', 'C', 'B']})
demo_train

Out[17]:

	letter
0	A
1	B
2	C
3	B

In [18]:

ohe = OneHotEncoder(sparse=False)

In [19]:

ohe.fit_transform(demo_train[['letter']])

Out[19]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

If you use fit_transform on testing data, it won't learn the same categories, which will be problematic:

In [20]:

demo_test = pd.DataFrame({'letter':['A', 'C', 'A']})
demo_test

Out[20]:

	letter
0	A
1	C
2	A

In [21]:

ohe.fit_transform(demo_test[['letter']])

Out[21]:

array([[1., 0.],
       [0., 1.],
       [1., 0.]])

Always use fit_transform on training data and transform (only) on testing data so that categories will be represented the same way:

In [22]:

ohe.fit_transform(demo_train[['letter']])

Out[22]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [23]:

ohe.transform(demo_test[['letter']])

Out[23]:

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

If testing data contains a new category, the encoder will error during transformation:

In [24]:

demo_test_unknown = pd.DataFrame({'letter':['A', 'C', 'D']})
demo_test_unknown

Out[24]:

	letter
0	A
1	C
2	D

In [25]:

# ohe.transform(demo_test_unknown[['letter']])

The solution is to tell the encoder to represent unknown categories as all zeros:

In [26]:

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [27]:

ohe.fit_transform(demo_train[['letter']])

Out[27]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [28]:

ohe.transform(demo_test_unknown[['letter']])

Out[28]:

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

Advice:

By default, keep handle_unknown='error' so you know if you are encountering new categories
If you encounter new categories, set handle_unknown='ignore' but keep in mind that all unknown categories will be encoded the same way
As soon as possible, retrain your model with data that includes any new categories

Chris: Should we drop one of the one-hot encoded features, since some models (such as linear regression) don't like when there is collinearity between features?¶

Here is the default one-hot encoding:

In [29]:

demo_train

Out[29]:

	letter
0	A
1	B
2	C
3	B

In [30]:

ohe.fit_transform(demo_train[['letter']])

Out[30]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

You can also drop the first level (new in version 0.21):

In [31]:

ohe = OneHotEncoder(sparse=False, drop='first')

In [32]:

ohe.fit_transform(demo_train[['letter']])

Out[32]:

array([[0., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.]])

Drop the first level when you know perfectly collinear features will cause problems
For most models, dropping the first level won't improve performance
Dropping the first level is incompatible with ignoring unknown categories
Dropping the first level is likely problematic if you scale your features or use a regularized model

Paolo: What encoding should I use with an ordinal feature?¶

If you have an ordinal feature (categorical feature with a logical ordering) that is already encoded numerically (such as Pclass), then leave it as-is:

In [33]:

df

Out[33]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

If you have an ordinal feature that is encoded as strings, then use OrdinalEncoder:

You define the logical order of the categories
Each input feature becomes a single output feature (unlike OneHotEncoder)

In [34]:

df_ordinal = pd.DataFrame({'Class': ['third', 'first', 'second', 'third'],
                           'Size': ['S', 'S', 'L', 'XL']})
df_ordinal

Out[34]:

	Class	Size
0	third	S
1	first	S
2	second	L
3	third	XL

In [35]:

from sklearn.preprocessing import OrdinalEncoder
ore = OrdinalEncoder(categories=[['first', 'second', 'third'], ['S', 'M', 'L', 'XL']])
ore.fit_transform(df_ordinal)

Out[35]:

array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [2., 3.]])

Vijey: What's the difference between OneHotEncoder, OrdinalEncoder, and LabelEncoder?¶

Use OneHotEncoder for unordered categorical features (nominal)
Use OrdinalEncoder for ordered categorical features (ordinal)
LabelEncoder is only for labels (meaning targets), and is rarely useful any more since scikit-learn classification models can handle string-based labels

Shreyas: In a ColumnTransformer, what are the other options for "remainder"?¶

"passthrough" means include all unspecified columns but don't modify them:

In [36]:

ohe = OneHotEncoder()

In [37]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

In [38]:

X.columns

Out[38]:

Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')

In [39]:

ct.fit_transform(X)

Out[39]:

<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

"drop" means drop all unspecified columns:

In [40]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='drop')

In [41]:

ct.fit_transform(X)

Out[41]:

<10x45 sparse matrix of type '<class 'numpy.float64'>'
	with 66 stored elements in Compressed Sparse Row format>

You can also set remainder to be a transformer object, in which case all unspecified columns will be transformed.

Hause: How do I get the column names for the output of ColumnTransformer?¶

get_feature_names works in some cases:

In [42]:

ct.get_feature_names()

Out[42]:

['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'countvectorizer__achem',
 'countvectorizer__adele',
 'countvectorizer__allen',
 'countvectorizer__berg',
 'countvectorizer__bradley',
 'countvectorizer__braund',
 'countvectorizer__briggs',
 'countvectorizer__cumings',
 'countvectorizer__elisabeth',
 'countvectorizer__florence',
 'countvectorizer__futrelle',
 'countvectorizer__gosta',
 'countvectorizer__harris',
 'countvectorizer__heath',
 'countvectorizer__heikkinen',
 'countvectorizer__henry',
 'countvectorizer__jacques',
 'countvectorizer__james',
 'countvectorizer__john',
 'countvectorizer__johnson',
 'countvectorizer__laina',
 'countvectorizer__leonard',
 'countvectorizer__lily',
 'countvectorizer__master',
 'countvectorizer__may',
 'countvectorizer__mccarthy',
 'countvectorizer__miss',
 'countvectorizer__moran',
 'countvectorizer__mr',
 'countvectorizer__mrs',
 'countvectorizer__nasser',
 'countvectorizer__nicholas',
 'countvectorizer__oscar',
 'countvectorizer__owen',
 'countvectorizer__palsson',
 'countvectorizer__peel',
 'countvectorizer__thayer',
 'countvectorizer__timothy',
 'countvectorizer__vilhelmina',
 'countvectorizer__william']

get_feature_names will not (yet) work with a passthrough transformer:

In [43]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

In [44]:

ct.fit_transform(X)

Out[44]:

<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

In [45]:

# ct.get_feature_names()

In that case, you will have to inspect the transformers one-by-one to figure out the column names:

In [46]:

ct.transformers_

Out[46]:

[('onehotencoder',
  OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
                handle_unknown='error', sparse=True),
  ['Embarked', 'Sex']),
 ('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                  lowercase=True, max_df=1.0, max_features=None, min_df=1,
                  ngram_range=(1, 1), preprocessor=None, stop_words=None,
                  strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, vocabulary=None),
  'Name'),
 ('remainder', 'passthrough', [0, 1])]

In [47]:

ct.named_transformers_.onehotencoder.get_feature_names()

Out[47]:

array(['x0_C', 'x0_Q', 'x0_S', 'x1_female', 'x1_male'], dtype=object)

In [48]:

ct.named_transformers_.countvectorizer.get_feature_names()

Out[48]:

['achem',
 'adele',
 'allen',
 'berg',
 'bradley',
 'braund',
 'briggs',
 'cumings',
 'elisabeth',
 'florence',
 'futrelle',
 'gosta',
 'harris',
 'heath',
 'heikkinen',
 'henry',
 'jacques',
 'james',
 'john',
 'johnson',
 'laina',
 'leonard',
 'lily',
 'master',
 'may',
 'mccarthy',
 'miss',
 'moran',
 'mr',
 'mrs',
 'nasser',
 'nicholas',
 'oscar',
 'owen',
 'palsson',
 'peel',
 'thayer',
 'timothy',
 'vilhelmina',
 'william']

In [49]:

X.columns

Out[49]:

Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')

Arjun: Is there a more efficient way to specify columns for a ColumnTransformer than listing them one-by-one?¶

You can specify columns by position or slice:

In [50]:

ct = make_column_transformer(
    (ohe, [2, 3]),
    (vect, 4),
    remainder='passthrough')

In [51]:

ct = make_column_transformer(
    (ohe, slice(2, 4)),
    (vect, 4),
    remainder='passthrough')

make_column_selector (new in version 0.22) allows you to select columns by regex pattern or data type:

In [52]:

from sklearn.compose import make_column_selector
cs = make_column_selector(pattern='E|S')

In [53]:

ct = make_column_transformer(
    (ohe, cs),
    (vect, 4),
    remainder='passthrough')

Tony: Does pipe.fit() modify the underlying objects (ct, logreg)?¶

Yes, it does modify the underlying objects:

In [54]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);

In [55]:

pipe.named_steps.logisticregression.coef_

Out[55]:

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

In [56]:

logreg.coef_

Out[56]:

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

Arjun: Regarding the code "pipe.named_steps.logisticregression.coef_", can you explain what it is doing and why it references "logisticregression" rather than "logreg"?¶

This is a two-step pipeline:

In [57]:

pipe

Out[57]:

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  <sklearn.compose._column_transformer.make_column_selector object...
                                                                  token_pattern='(?u)\\b\\w\\w+\\b',
                                                                  tokenizer=None,
                                                                  vocabulary=None),
                                                  4)],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=1,
                                    solver='liblinear', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

The step names were assigned by make_pipeline, and you can examine individual steps via the named_steps attribute:

In [58]:

pipe.named_steps.keys()

Out[58]:

dict_keys(['columntransformer', 'logisticregression'])

In [59]:

pipe.named_steps.columntransformer

Out[59]:

ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('onehotencoder',
                                 OneHotEncoder(categories='auto', drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               sparse=True),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7fb95ade1b50>),
                                ('countvectorizer',
                                 CountVectorizer(analyzer='word', binary=False,
                                                 decode_error='strict',
                                                 dtype=<class 'numpy.int64'>,
                                                 encoding='utf-8',
                                                 input='content',
                                                 lowercase=True, max_df=1.0,
                                                 max_features=None, min_df=1,
                                                 ngram_range=(1, 1),
                                                 preprocessor=None,
                                                 stop_words=None,
                                                 strip_accents=None,
                                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                                 tokenizer=None,
                                                 vocabulary=None),
                                 4)],
                  verbose=False)

In [60]:

pipe.named_steps.logisticregression

Out[60]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [61]:

pipe.named_steps.logisticregression.coef_

Out[61]:

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

Here are alternative ways to accomplish the same thing:

In [62]:

pipe.named_steps['logisticregression'].coef_

Out[62]:

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

In [63]:

pipe['logisticregression'].coef_

Out[63]:

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

In [64]:

pipe[1].coef_

Out[64]:

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

Arjun: What's the difference between make_pipeline and Pipeline?¶

make_pipeline:

Assigns default step names (lowercase version of the step's class name)
Easier to read and write than Pipeline code

In [65]:

pipe = make_pipeline(ct, logreg)
pipe.named_steps.keys()

Out[65]:

dict_keys(['columntransformer', 'logisticregression'])

Pipeline:

Requires you to assign step names
Custom step names can be useful for clarity when you are doing a grid search

In [66]:

from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe.named_steps.keys()

Out[66]:

dict_keys(['preprocessor', 'classifier'])

Hause: Can you walk us through the documentation for Pipeline and ColumnTransformer?¶

Five pages/page types you need to be familiar with:

API reference: high-level view
Class documentation: detailed view of a class
User guide: more examples and advice
Examples: more complex examples
Glossary: glossary of terms

Cathal: Is there a pandas method "pdpipe" that does something similar to scikit-learn's Pipeline?¶

pipe is a pandas method for including user-defined functions in a pandas method chain
pdpipe is a library for writing pandas code using an API that is similar to scikit-learn's Pipeline

Abla: Can you build feature interactions in a Pipeline?¶

Yes, using the PolynomialFeatures class, though I don't usually do so:

It doesn't scale well if you have lots of features
I prefer to use tree-based models that can learn feature interactions on their own

Charles: Why does CountVectorizer expect 1D input instead of 2D input?¶

Compare OneHotEncoder to CountVectorizer:

Most transformers (like OneHotEncoder) expect 2D input
CountVectorizer expects 1D input

In [67]:

ohe.fit_transform(X[['Embarked']])

Out[67]:

<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [68]:

vect.fit_transform(X['Name'])

Out[68]:

<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

In [69]:

ct = make_column_transformer(
    (ohe, ['Embarked']),
    (vect, 'Name'))

In [70]:

ct.fit_transform(X)

Out[70]:

<10x43 sparse matrix of type '<class 'numpy.float64'>'
	with 56 stored elements in Compressed Sparse Row format>

One possible reason: CountVectorizer isn't built to accept more than one column as input, thus it doesn't make sense for it to allow 2D input.

VK: How do I pass multiple columns to CountVectorizer?¶

Pass them in as two separate tuples:

In [71]:

ct = make_column_transformer(
    (vect, 'Name'),
    (vect, 'Sex'))

In [72]:

ct.fit_transform(X)

Out[72]:

<10x42 sparse matrix of type '<class 'numpy.longlong'>'
	with 56 stored elements in Compressed Sparse Row format>

make_column_transformer can't assign both of them the same name, so it appends numbers at the end:

In [73]:

ct.named_transformers_.keys()

Out[73]:

dict_keys(['countvectorizer-1', 'countvectorizer-2', 'remainder'])

Motasem: Would the document-term matrix have values greater than 1 if a word is repeated in a row?¶

Yes:

In [74]:

text = ['Machine Learning is fun', 'I am learning Machine Learning']

In [75]:

pd.DataFrame(vect.fit_transform(text).toarray(), columns=vect.get_feature_names())

Out[75]:

	am	fun	is	learning	machine
0	0	1	1	1	1
1	1	0	0	2	1

Anna: What does "stored elements" mean in the sparse matrix?¶

Stored elements is the number of non-zero values:

In [76]:

dtm = vect.fit_transform(text)
dtm

Out[76]:

<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [77]:

print(dtm)

  (0, 4)	1
  (0, 3)	1
  (0, 2)	1
  (0, 1)	1
  (1, 4)	1
  (1, 3)	2
  (1, 0)	1

Khaled: What happens if there are words in the testing set that didn't appear in the training set?¶

New words in the testing set will be ignored:

In [78]:

vect.get_feature_names()

Out[78]:

['am', 'fun', 'is', 'learning', 'machine']

In [79]:

vect.transform(['Data Science is FUN!']).toarray()

Out[79]:

array([[0, 1, 1, 0, 0]])

Anton: Once I've built a model (or pipeline) that I'm happy with, how can I save it so that I can use it later to make predictions?¶

Reset our pipeline:

In [80]:

ohe = OneHotEncoder()
vect = CountVectorizer()

In [81]:

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

In [82]:

logreg = LogisticRegression(solver='liblinear', random_state=1)

In [83]:

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);

You can save it to a file using pickle:

In [84]:

import pickle

In [85]:

with open('pipe.pickle', 'wb') as f:
    pickle.dump(pipe, f)

In [86]:

with open('pipe.pickle', 'rb') as f:
    pipe_from_pickle = pickle.load(f)

In [87]:

pipe_from_pickle.predict(X_new)

Out[87]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

You can save it to a file using joblib (which is more efficient than pickle for scikit-learn objects):

In [88]:

import joblib

In [89]:

joblib.dump(pipe, 'pipe.joblib')

Out[89]:

['pipe.joblib']

In [90]:

pipe_from_joblib = joblib.load('pipe.joblib')

In [91]:

pipe_from_joblib.predict(X_new)

Out[91]:

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

For both pickle and joblib objects:

You should only load it into an identical environment
You should only load objects you trust

Other alternatives are available that don't save the full model object, but do save a representation that can be used to make predictions:

See the "Model export for production" section on this page: scikit-learn Related Projects

Darnell: Will you be giving us homework or exercises in the course?¶

There won't be any in the Live Course, but there may be some exercises or walkthroughs in the Advanced Course.

Khaled: You mentioned that there will be an "Advanced Course" after this one. How can I register for it?¶

If you purchased the "Live Course + Advanced Course" bundle, you will get automatic access to the Advanced Course when it's released
If you did not purchase the bundle, please email me if you would like to purchase the upgrade