import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]
ohe = OneHotEncoder()
vect = CountVectorizer()
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
Selecting a Series versus a DataFrame:
df['Survived']
0 0 1 1 2 1 3 1 4 0 5 0 6 0 7 0 8 1 9 1 Name: Survived, dtype: int64
df[['Survived']]
Survived | |
---|---|
0 | 0 |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 0 |
5 | 0 |
6 | 0 |
7 | 0 |
8 | 1 |
9 | 1 |
Series is 1D, DataFrame is 2D:
df['Survived'].shape
(10,)
df[['Survived']].shape
(10, 1)
Use to_numpy() to convert a pandas object to a NumPy array:
df['Survived'].to_numpy()
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])
df[['Survived']].to_numpy()
array([[0], [1], [1], [1], [0], [0], [0], [0], [1], [1]])
logreg = LogisticRegression(solver='liblinear', random_state=1)
Solver:
random_state:
No, it's stored separately in the "intercept_" attribute.
I prefer not to use super-specialized functions like LogisticRegressionCV, and instead use GridSearchCV which works with any model and integrates well into the scikit-learn workflow.
Normally you would do cross-validation after each change, but in this case cross-validation scores would have been highly misleading due to the dataset size.
Yes, you fit the model to all samples for which you know the target value, otherwise you are throwing away useful training data.
Figure out what is important to you, and then choose an evaluation metric that matches those priorities.
Examples:
Recommended resources:
Use fit_transform on training data:
demo_train = pd.DataFrame({'letter':['A', 'B', 'C', 'B']})
demo_train
letter | |
---|---|
0 | A |
1 | B |
2 | C |
3 | B |
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(demo_train[['letter']])
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.], [0., 1., 0.]])
If you use fit_transform on testing data, it won't learn the same categories, which will be problematic:
demo_test = pd.DataFrame({'letter':['A', 'C', 'A']})
demo_test
letter | |
---|---|
0 | A |
1 | C |
2 | A |
ohe.fit_transform(demo_test[['letter']])
array([[1., 0.], [0., 1.], [1., 0.]])
Always use fit_transform on training data and transform (only) on testing data so that categories will be represented the same way:
ohe.fit_transform(demo_train[['letter']])
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.], [0., 1., 0.]])
ohe.transform(demo_test[['letter']])
array([[1., 0., 0.], [0., 0., 1.], [1., 0., 0.]])
If testing data contains a new category, the encoder will error during transformation:
demo_test_unknown = pd.DataFrame({'letter':['A', 'C', 'D']})
demo_test_unknown
letter | |
---|---|
0 | A |
1 | C |
2 | D |
# ohe.transform(demo_test_unknown[['letter']])
The solution is to tell the encoder to represent unknown categories as all zeros:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit_transform(demo_train[['letter']])
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.], [0., 1., 0.]])
ohe.transform(demo_test_unknown[['letter']])
array([[1., 0., 0.], [0., 0., 1.], [0., 0., 0.]])
Advice:
Here is the default one-hot encoding:
demo_train
letter | |
---|---|
0 | A |
1 | B |
2 | C |
3 | B |
ohe.fit_transform(demo_train[['letter']])
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.], [0., 1., 0.]])
You can also drop the first level (new in version 0.21):
ohe = OneHotEncoder(sparse=False, drop='first')
ohe.fit_transform(demo_train[['letter']])
array([[0., 0.], [1., 0.], [0., 1.], [1., 0.]])
If you have an ordinal feature (categorical feature with a logical ordering) that is already encoded numerically (such as Pclass), then leave it as-is:
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
If you have an ordinal feature that is encoded as strings, then use OrdinalEncoder:
df_ordinal = pd.DataFrame({'Class': ['third', 'first', 'second', 'third'],
'Size': ['S', 'S', 'L', 'XL']})
df_ordinal
Class | Size | |
---|---|---|
0 | third | S |
1 | first | S |
2 | second | L |
3 | third | XL |
from sklearn.preprocessing import OrdinalEncoder
ore = OrdinalEncoder(categories=[['first', 'second', 'third'], ['S', 'M', 'L', 'XL']])
ore.fit_transform(df_ordinal)
array([[2., 0.], [0., 0.], [1., 2.], [2., 3.]])
"passthrough" means include all unspecified columns but don't modify them:
ohe = OneHotEncoder()
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
X.columns
Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')
ct.fit_transform(X)
<10x47 sparse matrix of type '<class 'numpy.float64'>' with 78 stored elements in Compressed Sparse Row format>
"drop" means drop all unspecified columns:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='drop')
ct.fit_transform(X)
<10x45 sparse matrix of type '<class 'numpy.float64'>' with 66 stored elements in Compressed Sparse Row format>
You can also set remainder to be a transformer object, in which case all unspecified columns will be transformed.
get_feature_names works in some cases:
ct.get_feature_names()
['onehotencoder__x0_C', 'onehotencoder__x0_Q', 'onehotencoder__x0_S', 'onehotencoder__x1_female', 'onehotencoder__x1_male', 'countvectorizer__achem', 'countvectorizer__adele', 'countvectorizer__allen', 'countvectorizer__berg', 'countvectorizer__bradley', 'countvectorizer__braund', 'countvectorizer__briggs', 'countvectorizer__cumings', 'countvectorizer__elisabeth', 'countvectorizer__florence', 'countvectorizer__futrelle', 'countvectorizer__gosta', 'countvectorizer__harris', 'countvectorizer__heath', 'countvectorizer__heikkinen', 'countvectorizer__henry', 'countvectorizer__jacques', 'countvectorizer__james', 'countvectorizer__john', 'countvectorizer__johnson', 'countvectorizer__laina', 'countvectorizer__leonard', 'countvectorizer__lily', 'countvectorizer__master', 'countvectorizer__may', 'countvectorizer__mccarthy', 'countvectorizer__miss', 'countvectorizer__moran', 'countvectorizer__mr', 'countvectorizer__mrs', 'countvectorizer__nasser', 'countvectorizer__nicholas', 'countvectorizer__oscar', 'countvectorizer__owen', 'countvectorizer__palsson', 'countvectorizer__peel', 'countvectorizer__thayer', 'countvectorizer__timothy', 'countvectorizer__vilhelmina', 'countvectorizer__william']
get_feature_names will not (yet) work with a passthrough transformer:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
ct.fit_transform(X)
<10x47 sparse matrix of type '<class 'numpy.float64'>' with 78 stored elements in Compressed Sparse Row format>
# ct.get_feature_names()
In that case, you will have to inspect the transformers one-by-one to figure out the column names:
ct.transformers_
[('onehotencoder', OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>, handle_unknown='error', sparse=True), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None), 'Name'), ('remainder', 'passthrough', [0, 1])]
ct.named_transformers_.onehotencoder.get_feature_names()
array(['x0_C', 'x0_Q', 'x0_S', 'x1_female', 'x1_male'], dtype=object)
ct.named_transformers_.countvectorizer.get_feature_names()
['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']
X.columns
Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')
You can specify columns by position or slice:
ct = make_column_transformer(
(ohe, [2, 3]),
(vect, 4),
remainder='passthrough')
ct = make_column_transformer(
(ohe, slice(2, 4)),
(vect, 4),
remainder='passthrough')
make_column_selector (new in version 0.22) allows you to select columns by regex pattern or data type:
from sklearn.compose import make_column_selector
cs = make_column_selector(pattern='E|S')
ct = make_column_transformer(
(ohe, cs),
(vect, 4),
remainder='passthrough')
Yes, it does modify the underlying objects:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);
pipe.named_steps.logisticregression.coef_
array([[ 0.18828769, -0.14100295, -0.16593861, 0.66504677, -0.78370063, 0.11596792, 0.11596792, -0.1262833 , 0.13845919, 0.07231978, -0.12539973, 0.07231978, 0.07231978, 0.13845919, 0.07231978, 0.10454614, -0.18913104, -0.12539973, 0.10454614, 0.23375375, -0.1262833 , 0.10454614, -0.14100295, 0.07231978, 0.13845919, 0.23375375, -0.18913104, 0.10454614, -0.18913104, 0.10454614, -0.20188362, 0.23375375, -0.14100295, -0.5945696 , 0.43129302, 0.11596792, 0.11596792, 0.13845919, -0.12539973, -0.18913104, 0.10454614, 0.07231978, -0.20188362, 0.13845919, -0.1262833 , 0.08778734, 0.01334678]])
logreg.coef_
array([[ 0.18828769, -0.14100295, -0.16593861, 0.66504677, -0.78370063, 0.11596792, 0.11596792, -0.1262833 , 0.13845919, 0.07231978, -0.12539973, 0.07231978, 0.07231978, 0.13845919, 0.07231978, 0.10454614, -0.18913104, -0.12539973, 0.10454614, 0.23375375, -0.1262833 , 0.10454614, -0.14100295, 0.07231978, 0.13845919, 0.23375375, -0.18913104, 0.10454614, -0.18913104, 0.10454614, -0.20188362, 0.23375375, -0.14100295, -0.5945696 , 0.43129302, 0.11596792, 0.11596792, 0.13845919, -0.12539973, -0.18913104, 0.10454614, 0.07231978, -0.20188362, 0.13845919, -0.1262833 , 0.08778734, 0.01334678]])
This is a two-step pipeline:
pipe
Pipeline(memory=None, steps=[('columntransformer', ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3, transformer_weights=None, transformers=[('onehotencoder', OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>, handle_unknown='error', sparse=True), <sklearn.compose._column_transformer.make_column_selector object... token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None), 4)], verbose=False)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=1, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))], verbose=False)
The step names were assigned by make_pipeline, and you can examine individual steps via the named_steps attribute:
pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])
pipe.named_steps.columntransformer
ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3, transformer_weights=None, transformers=[('onehotencoder', OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>, handle_unknown='error', sparse=True), <sklearn.compose._column_transformer.make_column_selector object at 0x7fb95ade1b50>), ('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None), 4)], verbose=False)
pipe.named_steps.logisticregression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=1, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
pipe.named_steps.logisticregression.coef_
array([[ 0.18828769, -0.14100295, -0.16593861, 0.66504677, -0.78370063, 0.11596792, 0.11596792, -0.1262833 , 0.13845919, 0.07231978, -0.12539973, 0.07231978, 0.07231978, 0.13845919, 0.07231978, 0.10454614, -0.18913104, -0.12539973, 0.10454614, 0.23375375, -0.1262833 , 0.10454614, -0.14100295, 0.07231978, 0.13845919, 0.23375375, -0.18913104, 0.10454614, -0.18913104, 0.10454614, -0.20188362, 0.23375375, -0.14100295, -0.5945696 , 0.43129302, 0.11596792, 0.11596792, 0.13845919, -0.12539973, -0.18913104, 0.10454614, 0.07231978, -0.20188362, 0.13845919, -0.1262833 , 0.08778734, 0.01334678]])
Here are alternative ways to accomplish the same thing:
pipe.named_steps['logisticregression'].coef_
array([[ 0.18828769, -0.14100295, -0.16593861, 0.66504677, -0.78370063, 0.11596792, 0.11596792, -0.1262833 , 0.13845919, 0.07231978, -0.12539973, 0.07231978, 0.07231978, 0.13845919, 0.07231978, 0.10454614, -0.18913104, -0.12539973, 0.10454614, 0.23375375, -0.1262833 , 0.10454614, -0.14100295, 0.07231978, 0.13845919, 0.23375375, -0.18913104, 0.10454614, -0.18913104, 0.10454614, -0.20188362, 0.23375375, -0.14100295, -0.5945696 , 0.43129302, 0.11596792, 0.11596792, 0.13845919, -0.12539973, -0.18913104, 0.10454614, 0.07231978, -0.20188362, 0.13845919, -0.1262833 , 0.08778734, 0.01334678]])
pipe['logisticregression'].coef_
array([[ 0.18828769, -0.14100295, -0.16593861, 0.66504677, -0.78370063, 0.11596792, 0.11596792, -0.1262833 , 0.13845919, 0.07231978, -0.12539973, 0.07231978, 0.07231978, 0.13845919, 0.07231978, 0.10454614, -0.18913104, -0.12539973, 0.10454614, 0.23375375, -0.1262833 , 0.10454614, -0.14100295, 0.07231978, 0.13845919, 0.23375375, -0.18913104, 0.10454614, -0.18913104, 0.10454614, -0.20188362, 0.23375375, -0.14100295, -0.5945696 , 0.43129302, 0.11596792, 0.11596792, 0.13845919, -0.12539973, -0.18913104, 0.10454614, 0.07231978, -0.20188362, 0.13845919, -0.1262833 , 0.08778734, 0.01334678]])
pipe[1].coef_
array([[ 0.18828769, -0.14100295, -0.16593861, 0.66504677, -0.78370063, 0.11596792, 0.11596792, -0.1262833 , 0.13845919, 0.07231978, -0.12539973, 0.07231978, 0.07231978, 0.13845919, 0.07231978, 0.10454614, -0.18913104, -0.12539973, 0.10454614, 0.23375375, -0.1262833 , 0.10454614, -0.14100295, 0.07231978, 0.13845919, 0.23375375, -0.18913104, 0.10454614, -0.18913104, 0.10454614, -0.20188362, 0.23375375, -0.14100295, -0.5945696 , 0.43129302, 0.11596792, 0.11596792, 0.13845919, -0.12539973, -0.18913104, 0.10454614, 0.07231978, -0.20188362, 0.13845919, -0.1262833 , 0.08778734, 0.01334678]])
make_pipeline:
pipe = make_pipeline(ct, logreg)
pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])
Pipeline:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe.named_steps.keys()
dict_keys(['preprocessor', 'classifier'])
Five pages/page types you need to be familiar with:
Yes, using the PolynomialFeatures class, though I don't usually do so:
Compare OneHotEncoder to CountVectorizer:
ohe.fit_transform(X[['Embarked']])
<10x3 sparse matrix of type '<class 'numpy.float64'>' with 10 stored elements in Compressed Sparse Row format>
vect.fit_transform(X['Name'])
<10x40 sparse matrix of type '<class 'numpy.int64'>' with 46 stored elements in Compressed Sparse Row format>
ct = make_column_transformer(
(ohe, ['Embarked']),
(vect, 'Name'))
ct.fit_transform(X)
<10x43 sparse matrix of type '<class 'numpy.float64'>' with 56 stored elements in Compressed Sparse Row format>
One possible reason: CountVectorizer isn't built to accept more than one column as input, thus it doesn't make sense for it to allow 2D input.
Pass them in as two separate tuples:
ct = make_column_transformer(
(vect, 'Name'),
(vect, 'Sex'))
ct.fit_transform(X)
<10x42 sparse matrix of type '<class 'numpy.longlong'>' with 56 stored elements in Compressed Sparse Row format>
make_column_transformer can't assign both of them the same name, so it appends numbers at the end:
ct.named_transformers_.keys()
dict_keys(['countvectorizer-1', 'countvectorizer-2', 'remainder'])
Yes:
text = ['Machine Learning is fun', 'I am learning Machine Learning']
pd.DataFrame(vect.fit_transform(text).toarray(), columns=vect.get_feature_names())
am | fun | is | learning | machine | |
---|---|---|---|---|---|
0 | 0 | 1 | 1 | 1 | 1 |
1 | 1 | 0 | 0 | 2 | 1 |
Stored elements is the number of non-zero values:
dtm = vect.fit_transform(text)
dtm
<2x5 sparse matrix of type '<class 'numpy.int64'>' with 7 stored elements in Compressed Sparse Row format>
print(dtm)
(0, 4) 1 (0, 3) 1 (0, 2) 1 (0, 1) 1 (1, 4) 1 (1, 3) 2 (1, 0) 1
New words in the testing set will be ignored:
vect.get_feature_names()
['am', 'fun', 'is', 'learning', 'machine']
vect.transform(['Data Science is FUN!']).toarray()
array([[0, 1, 1, 0, 0]])
Reset our pipeline:
ohe = OneHotEncoder()
vect = CountVectorizer()
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);
You can save it to a file using pickle:
import pickle
with open('pipe.pickle', 'wb') as f:
pickle.dump(pipe, f)
with open('pipe.pickle', 'rb') as f:
pipe_from_pickle = pickle.load(f)
pipe_from_pickle.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
You can save it to a file using joblib (which is more efficient than pickle for scikit-learn objects):
import joblib
joblib.dump(pipe, 'pipe.joblib')
['pipe.joblib']
pipe_from_joblib = joblib.load('pipe.joblib')
pipe_from_joblib.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
For both pickle and joblib objects:
Other alternatives are available that don't save the full model object, but do save a representation that can be used to make predictions:
There won't be any in the Live Course, but there may be some exercises or walkthroughs in the Advanced Course.