Check your scikit-learn version:
import sklearn
sklearn.__version__
Load 10 rows from the famous Titanic dataset:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
Basic terminology:
df
We want to use "Parch" and "Fare" as initial features:
Define X and y:
X = df[['Parch', 'Fare']]
X
y = df['Survived']
y
Check the object shapes:
X.shape
y.shape
Create a model object:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)
Evaluate the model using cross-validation:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()
Train the model on the entire dataset:
logreg.fit(X, y)
Read in a new dataset for which we don't know the target values:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
df_new
Define X_new to have the same columns as X:
X_new = df_new[['Parch', 'Fare']]
X_new
Use the trained model to make predictions for X_new:
logreg.predict(X_new)
We want to use "Embarked" and "Sex" as additional features:
df
Encode "Embarked" using one-hot encoding:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])
Ask for a dense (not sparse) matrix so that we can examine the encoding:
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked']])
ohe.categories_
What's the difference between "fit" and "transform"?
Encode "Embarked" and "Sex" at the same time:
ohe.fit_transform(df[['Embarked', 'Sex']])
ohe.categories_
How could we include "Embarked" and "Sex" in the model along with "Parch" and "Fare"?
Goals:
Create a list of columns and use that to update X:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
X = df[cols]
X
Create an instance of OneHotEncoder with the default options:
ohe = OneHotEncoder()
Create a ColumnTransformer:
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
remainder='passthrough')
Perform the transformation:
ct.fit_transform(X)
Use Pipeline to chain together sequential steps:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)
Fit the Pipeline:
pipe.fit(X, y)
This is what happens "under the hood" when you fit the Pipeline:
logreg.fit(ct.fit_transform(X), y)
You can select the steps of a Pipeline by name in order to inspect them:
pipe.named_steps.logisticregression.coef_
Update X_new to have the same columns as X:
X_new = df_new[cols]
X_new
Use the fitted Pipeline to make predictions for X_new:
pipe.predict(X_new)
This is what happens "under the hood" when you make predictions using the Pipeline:
logreg.predict(ct.transform(X_new))
This is all of the code that is necessary to recreate our workflow up to this point:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]
ohe = OneHotEncoder()
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
remainder='passthrough')
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
Summary of our ColumnTransformer:
Summary of our Pipeline:
Comparing Pipeline and ColumnTransformer:
We want to use "Name" as an additional feature:
df
Use CountVectorizer to convert text into a matrix of token counts:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
dtm = vect.fit_transform(df['Name'])
dtm
Examine the feature names:
print(vect.get_feature_names())
Examine the document-term matrix as a DataFrame:
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
Update X to include the "Name" column:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X
Update the ColumnTransformer:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
Perform the transformation:
ct.fit_transform(X)
Update the Pipeline to contain the modified ColumnTransformer:
pipe = make_pipeline(ct, logreg)
Fit the Pipeline and examine the steps:
pipe.fit(X, y)
pipe.named_steps
Update X_new to include the "Name" column:
X_new = df_new[cols]
Use the fitted Pipeline to make predictions for X_new:
pipe.predict(X_new)