Check your scikit-learn version:
import sklearn
sklearn.__version__
'0.22.1'
Load 10 rows from the famous Titanic dataset:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
Basic terminology:
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
We want to use "Parch" and "Fare" as initial features:
Define X and y:
X = df[['Parch', 'Fare']]
X
Parch | Fare | |
---|---|---|
0 | 0 | 7.2500 |
1 | 0 | 71.2833 |
2 | 0 | 7.9250 |
3 | 0 | 53.1000 |
4 | 0 | 8.0500 |
5 | 0 | 8.4583 |
6 | 0 | 51.8625 |
7 | 1 | 21.0750 |
8 | 2 | 11.1333 |
9 | 0 | 30.0708 |
y = df['Survived']
y
0 0 1 1 2 1 3 1 4 0 5 0 6 0 7 0 8 1 9 1 Name: Survived, dtype: int64
Check the object shapes:
X.shape
(10, 2)
y.shape
(10,)
Create a model object:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)
Evaluate the model using cross-validation:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()
0.6944444444444443
Train the model on the entire dataset:
logreg.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=1, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Read in a new dataset for which we don't know the target values:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
df_new
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
5 | 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | NaN | S |
6 | 898 | 3 | Connolly, Miss. Kate | female | 30.0 | 0 | 0 | 330972 | 7.6292 | NaN | Q |
7 | 899 | 2 | Caldwell, Mr. Albert Francis | male | 26.0 | 1 | 1 | 248738 | 29.0000 | NaN | S |
8 | 900 | 3 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | female | 18.0 | 0 | 0 | 2657 | 7.2292 | NaN | C |
9 | 901 | 3 | Davies, Mr. John Samuel | male | 21.0 | 2 | 0 | A/4 48871 | 24.1500 | NaN | S |
Define X_new to have the same columns as X:
X_new = df_new[['Parch', 'Fare']]
X_new
Parch | Fare | |
---|---|---|
0 | 0 | 7.8292 |
1 | 0 | 7.0000 |
2 | 0 | 9.6875 |
3 | 0 | 8.6625 |
4 | 1 | 12.2875 |
5 | 0 | 9.2250 |
6 | 0 | 7.6292 |
7 | 1 | 29.0000 |
8 | 0 | 7.2292 |
9 | 0 | 24.1500 |
Use the trained model to make predictions for X_new:
logreg.predict(X_new)
array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
We want to use "Embarked" and "Sex" as additional features:
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
Encode "Embarked" using one-hot encoding:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])
<10x3 sparse matrix of type '<class 'numpy.float64'>' with 10 stored elements in Compressed Sparse Row format>
Ask for a dense (not sparse) matrix so that we can examine the encoding:
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked']])
array([[0., 0., 1.], [1., 0., 0.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [0., 1., 0.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [1., 0., 0.]])
ohe.categories_
[array(['C', 'Q', 'S'], dtype=object)]
What's the difference between "fit" and "transform"?
Encode "Embarked" and "Sex" at the same time:
ohe.fit_transform(df[['Embarked', 'Sex']])
array([[0., 0., 1., 0., 1.], [1., 0., 0., 1., 0.], [0., 0., 1., 1., 0.], [0., 0., 1., 1., 0.], [0., 0., 1., 0., 1.], [0., 1., 0., 0., 1.], [0., 0., 1., 0., 1.], [0., 0., 1., 0., 1.], [0., 0., 1., 1., 0.], [1., 0., 0., 1., 0.]])
ohe.categories_
[array(['C', 'Q', 'S'], dtype=object), array(['female', 'male'], dtype=object)]
How could we include "Embarked" and "Sex" in the model along with "Parch" and "Fare"?
Goals:
Create a list of columns and use that to update X:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
X = df[cols]
X
Parch | Fare | Embarked | Sex | |
---|---|---|---|---|
0 | 0 | 7.2500 | S | male |
1 | 0 | 71.2833 | C | female |
2 | 0 | 7.9250 | S | female |
3 | 0 | 53.1000 | S | female |
4 | 0 | 8.0500 | S | male |
5 | 0 | 8.4583 | Q | male |
6 | 0 | 51.8625 | S | male |
7 | 1 | 21.0750 | S | male |
8 | 2 | 11.1333 | S | female |
9 | 0 | 30.0708 | C | female |
Create an instance of OneHotEncoder with the default options:
ohe = OneHotEncoder()
Create a ColumnTransformer:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
remainder='passthrough')
Perform the transformation:
ct.fit_transform(X)
array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ], [ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833], [ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ], [ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ], [ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ], [ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583], [ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625], [ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ], [ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333], [ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])
Use Pipeline to chain together sequential steps:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)
Fit the Pipeline:
pipe.fit(X, y)
Pipeline(memory=None, steps=[('columntransformer', ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3, transformer_weights=None, transformers=[('onehotencoder', OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>, handle_unknown='error', sparse=True), ['Embarked', 'Sex'])], verbose=False)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=1, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))], verbose=False)
This is what happens "under the hood" when you fit the Pipeline:
logreg.fit(ct.fit_transform(X), y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=1, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
You can select the steps of a Pipeline by name in order to inspect them:
pipe.named_steps.logisticregression.coef_
array([[ 0.26491287, -0.19848033, -0.22907928, 1.0075062 , -1.17015293, 0.20056557, 0.01597307]])
Update X_new to have the same columns as X:
X_new = df_new[cols]
X_new
Parch | Fare | Embarked | Sex | |
---|---|---|---|---|
0 | 0 | 7.8292 | Q | male |
1 | 0 | 7.0000 | S | female |
2 | 0 | 9.6875 | Q | male |
3 | 0 | 8.6625 | S | male |
4 | 1 | 12.2875 | S | female |
5 | 0 | 9.2250 | S | male |
6 | 0 | 7.6292 | Q | female |
7 | 1 | 29.0000 | S | male |
8 | 0 | 7.2292 | C | female |
9 | 0 | 24.1500 | S | male |
Use the fitted Pipeline to make predictions for X_new:
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
This is what happens "under the hood" when you make predictions using the Pipeline:
logreg.predict(ct.transform(X_new))
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
This is all of the code that is necessary to recreate our workflow up to this point:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]
ohe = OneHotEncoder()
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
remainder='passthrough')
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
Summary of our ColumnTransformer:
Summary of our Pipeline:
Comparing Pipeline and ColumnTransformer:
We want to use "Name" as an additional feature:
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
Use CountVectorizer to convert text into a matrix of token counts:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
dtm = vect.fit_transform(df['Name'])
dtm
<10x40 sparse matrix of type '<class 'numpy.int64'>' with 46 stored elements in Compressed Sparse Row format>
Examine the feature names:
print(vect.get_feature_names())
['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']
Examine the document-term matrix as a DataFrame:
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
achem | adele | allen | berg | bradley | braund | briggs | cumings | elisabeth | florence | ... | nasser | nicholas | oscar | owen | palsson | peel | thayer | timothy | vilhelmina | william | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
9 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 40 columns
Update X to include the "Name" column:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X
Parch | Fare | Embarked | Sex | Name | |
---|---|---|---|---|---|
0 | 0 | 7.2500 | S | male | Braund, Mr. Owen Harris |
1 | 0 | 71.2833 | C | female | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | 0 | 7.9250 | S | female | Heikkinen, Miss. Laina |
3 | 0 | 53.1000 | S | female | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | 0 | 8.0500 | S | male | Allen, Mr. William Henry |
5 | 0 | 8.4583 | Q | male | Moran, Mr. James |
6 | 0 | 51.8625 | S | male | McCarthy, Mr. Timothy J |
7 | 1 | 21.0750 | S | male | Palsson, Master. Gosta Leonard |
8 | 2 | 11.1333 | S | female | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) |
9 | 0 | 30.0708 | C | female | Nasser, Mrs. Nicholas (Adele Achem) |
Update the ColumnTransformer:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
Perform the transformation:
ct.fit_transform(X)
<10x47 sparse matrix of type '<class 'numpy.float64'>' with 78 stored elements in Compressed Sparse Row format>
Update the Pipeline to contain the modified ColumnTransformer:
pipe = make_pipeline(ct, logreg)
Fit the Pipeline and examine the steps:
pipe.fit(X, y)
pipe.named_steps
{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3, transformer_weights=None, transformers=[('onehotencoder', OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>, handle_unknown='error', sparse=True), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None), 'Name')], verbose=False), 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=1, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)}
Update X_new to include the "Name" column:
X_new = df_new[cols]
Use the fitted Pipeline to make predictions for X_new:
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])