Working a Text-Based Data Science Problem

Agenda

  1. Reading in and exploring the data
  2. Feature engineering
  3. Model evaluation using train_test_split and cross_val_score
  4. Making predictions for new data
  5. Searching for optimal tuning parameters using GridSearchCV
  6. Extracting features from text using CountVectorizer
  7. Chaining steps into a Pipeline

Part 1: Reading in and exploring the data

In [1]:
import pandas as pd
train = pd.read_json('../data/train.json')
train.head()
Out[1]:
cuisine id ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...
3 indian 22213 [water, vegetable oil, wheat, salt]
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...
In [2]:
train.shape
Out[2]:
(39774, 3)
In [3]:
# count the number of null values in each column
train.isnull().sum()
Out[3]:
cuisine        0
id             0
ingredients    0
dtype: int64
In [4]:
train.dtypes
Out[4]:
cuisine        object
id              int64
ingredients    object
dtype: object
In [5]:
# select row 0, column 'ingredients'
train.loc[0, 'ingredients']
Out[5]:
['romaine lettuce',
 'black olives',
 'grape tomatoes',
 'garlic',
 'pepper',
 'purple onion',
 'seasoning',
 'garbanzo beans',
 'feta cheese crumbles']
In [6]:
# ingredients are stored as a list of strings, not as a string
type(train.loc[0, 'ingredients'])
Out[6]:
list
In [7]:
# examine the class distribution
train.cuisine.value_counts()
Out[7]:
italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

Part 2: Feature engineering

  • Feature engineering is the process through which you create features that don't natively exist in the dataset.
  • Your goal is to create features that contain the signal from the data (with respect to the response value), rather than the noise.

Example: Number of ingredients

In [8]:
# count the number of ingredients in each recipe
train['num_ingredients'] = train.ingredients.apply(len)
train.head()
Out[8]:
cuisine id ingredients num_ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes... 9
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... 11
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... 12
3 indian 22213 [water, vegetable oil, wheat, salt] 4
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... 20
In [9]:
# for each cuisine, calculate the mean number of ingredients
train.groupby('cuisine').num_ingredients.mean()
Out[9]:
cuisine
brazilian        9.520343
british          9.708955
cajun_creole    12.617076
chinese         11.982791
filipino        10.000000
french           9.817838
greek           10.182128
indian          12.705961
irish            9.299850
italian          9.909033
jamaican        12.214829
japanese         9.735067
korean          11.284337
mexican         10.877446
moroccan        12.909866
russian         10.224949
southern_us      9.634954
spanish         10.423660
thai            12.545809
vietnamese      12.675152
Name: num_ingredients, dtype: float64
In [10]:
# for each cuisine, "describe" the number of ingredients (and unstack into a DataFrame)
train.groupby('cuisine').num_ingredients.describe().unstack()
Out[10]:
       cuisine     
count  brazilian        467.000000
       british          804.000000
       cajun_creole    1546.000000
       chinese         2673.000000
       filipino         755.000000
       french          2646.000000
       greek           1175.000000
       indian          3003.000000
       irish            667.000000
       italian         7838.000000
       jamaican         526.000000
       japanese        1423.000000
       korean           830.000000
       mexican         6438.000000
       moroccan         821.000000
       russian          489.000000
       southern_us     4320.000000
       spanish          989.000000
       thai            1539.000000
       vietnamese       825.000000
mean   brazilian          9.520343
       british            9.708955
       cajun_creole      12.617076
       chinese           11.982791
       filipino          10.000000
       french             9.817838
       greek             10.182128
       indian            12.705961
       irish              9.299850
       italian            9.909033
                          ...     
75%    jamaican          15.000000
       japanese          12.000000
       korean            14.000000
       mexican           14.000000
       moroccan          16.000000
       russian           13.000000
       southern_us       12.000000
       spanish           13.000000
       thai              15.000000
       vietnamese        16.000000
max    brazilian         59.000000
       british           30.000000
       cajun_creole      31.000000
       chinese           38.000000
       filipino          38.000000
       french            31.000000
       greek             27.000000
       indian            49.000000
       irish             27.000000
       italian           65.000000
       jamaican          35.000000
       japanese          34.000000
       korean            29.000000
       mexican           52.000000
       moroccan          31.000000
       russian           25.000000
       southern_us       40.000000
       spanish           35.000000
       thai              40.000000
       vietnamese        31.000000
Length: 160, dtype: float64
In [11]:
# allow plots to appear in the notebook
%matplotlib inline
In [12]:
# box plot of number ingredients for each cuisine
train.boxplot('num_ingredients', by='cuisine')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x11fb66a58>

Example: Mean length of ingredient names

In [13]:
sample_recipe = train.loc[3, 'ingredients']
print(sample_recipe)
['water', 'vegetable oil', 'wheat', 'salt']
In [14]:
import numpy as np
In [15]:
# define a function that calculates the mean string length from a list of strings
def mean_string_length(list_of_strings):
    return np.mean([len(string) for string in list_of_strings])
In [16]:
mean_string_length(sample_recipe)
Out[16]:
6.75
In [17]:
# calculate the mean ingredient length for each recipe (two different ways)
train['ingredient_length'] = train.ingredients.apply(mean_string_length)
train['ingredient_length'] = train.ingredients.apply(lambda x: np.mean([len(item) for item in x]))
train.head()
Out[17]:
cuisine id ingredients num_ingredients ingredient_length
0 greek 10259 [romaine lettuce, black olives, grape tomatoes... 9 12.000000
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... 11 10.090909
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... 12 10.333333
3 indian 22213 [water, vegetable oil, wheat, salt] 4 6.750000
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... 20 10.100000
In [18]:
# box plot of mean ingredient length for each cuisine
train.boxplot('ingredient_length', by='cuisine')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11fe48a90>
In [19]:
# define a function that accepts a DataFrame and adds new features
def make_features(df):
    df['num_ingredients'] = df.ingredients.apply(len)
    df['ingredient_length'] = df.ingredients.apply(lambda x: np.mean([len(item) for item in x]))
    return df
In [20]:
# check that the function works
train = make_features(pd.read_json('../data/train.json'))
train.head()
Out[20]:
cuisine id ingredients num_ingredients ingredient_length
0 greek 10259 [romaine lettuce, black olives, grape tomatoes... 9 12.000000
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... 11 10.090909
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... 12 10.333333
3 indian 22213 [water, vegetable oil, wheat, salt] 4 6.750000
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... 20 10.100000

Part 3: Model evaluation using train_test_split and cross_val_score

  • The motivation for model evaluation is that you need a way to choose between models (different model types, tuning parameters, and features).
  • You use a model evaluation procedure to estimate how well a model will generalize to out-of-sample data.
  • This requires a model evaluation metric to quantify a model's performance.
In [21]:
# define X and y
feature_cols = ['num_ingredients', 'ingredient_length']
X = train[feature_cols]
y = train.cuisine
In [22]:
print(X.shape)
print(y.shape)
(39774, 2)
(39774,)
In [23]:
# note: response values are strings (not numbers)
y.values
Out[23]:
array(['greek', 'southern_us', 'filipino', ..., 'irish', 'chinese',
       'mexican'], dtype=object)
In [24]:
# use KNN with K=100
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=100)

Train/test split

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [26]:
# make class predictions for the testing set
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
In [27]:
# check the classification accuracy of KNN's predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
Out[27]:
0.21932823813354788

K-fold cross-validation

  • Train/test split is faster and more flexible
  • Cross-validation provides a more accurate estimate of out-of-sample performance
In [28]:
# evaluate with 5-fold cross-validation (using X instead of X_train)
from sklearn.model_selection import cross_val_score
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()
Out[28]:
0.21591924749538957

Null model

  • For classification problems, the null model always predicts the most frequent class from the training data.
  • For regression problems, the null model always predicts the mean of the response value from the training data.
  • It can be a useful baseline model against which your model is measured.
In [29]:
# calculate the null accuracy
y_test.value_counts().head(1) / y_test.shape
Out[29]:
italian    0.199216
Name: cuisine, dtype: float64
In [30]:
# use DummyClassifier instead
from sklearn.dummy import DummyClassifier
dumb = DummyClassifier(strategy='most_frequent')
dumb.fit(X_train, y_train)
y_pred_class = dumb.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)
Out[30]:
0.1992156074014481

Part 4: Making predictions for new data

In [31]:
# read in test.json and add the additional features
new = make_features(pd.read_json('../data/test.json'))
new.head()
Out[31]:
id ingredients num_ingredients ingredient_length
0 18009 [baking powder, eggs, all-purpose flour, raisi... 6 9.333333
1 28583 [sugar, egg yolks, corn starch, cream of tarta... 11 10.272727
2 41580 [sausage links, fennel bulb, fronds, olive oil... 6 9.666667
3 29752 [meat cuts, file powder, smoked sausage, okra,... 21 12.000000
4 35687 [ground black pepper, salt, sausage casings, l... 8 13.000000
In [32]:
new.shape
Out[32]:
(9944, 4)
In [33]:
# create a DataFrame of the relevant columns from the new data
X_new = new[feature_cols]
X_new.head()
Out[33]:
num_ingredients ingredient_length
0 6 9.333333
1 11 10.272727
2 6 9.666667
3 21 12.000000
4 8 13.000000
In [34]:
X_new.shape
Out[34]:
(9944, 2)
In [35]:
# train KNN on ALL of the training data (using X instead of X_train)
knn.fit(X, y)
Out[35]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=100, p=2,
           weights='uniform')
In [36]:
# make class predictions for the new data
new_pred_class_knn = knn.predict(X_new)
new_pred_class_knn
Out[36]:
array(['mexican', 'southern_us', 'mexican', ..., 'italian', 'mexican',
       'mexican'], dtype=object)
In [37]:
new_pred_class_knn.shape
Out[37]:
(9944,)
In [38]:
# create a DataFrame that only contains the IDs and predicted classes for the new data
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class_knn}).set_index('id').head()
Out[38]:
cuisine
id
18009 mexican
28583 southern_us
41580 mexican
29752 mexican
35687 italian
In [39]:
# create a submission file from that DataFrame (score: 0.21742)
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class_knn}).set_index('id').to_csv('sub1.csv')

Part 5: Searching for optimal tuning parameters using GridSearchCV

In [40]:
# reminder of the cross-validated accuracy of KNN with K=100
knn = KNeighborsClassifier(n_neighbors=100)
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()
Out[40]:
0.21591924749538957
In [41]:
from sklearn.model_selection import GridSearchCV
In [42]:
# define a "parameter grid" in which the key is the parameter and the value is a list of options to try
param_grid = {}
param_grid['n_neighbors'] = [100, 200]
param_grid
Out[42]:
{'n_neighbors': [100, 200]}
In [43]:
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
In [44]:
# run the grid search
grid.fit(X, y)
Out[44]:
GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=100, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [100, 200]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='accuracy',
       verbose=0)
In [45]:
# examine the scores for each parameter option
grid.grid_scores_
/Users/georgioskarakostas/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py:762: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[45]:
[mean: 0.21592, std: 0.00172, params: {'n_neighbors': 100},
 mean: 0.21949, std: 0.00181, params: {'n_neighbors': 200}]
In [46]:
# try K=200 to 1000 (by 200)
param_grid = {}
param_grid['n_neighbors'] = list(range(200, 1001, 200))
param_grid
Out[46]:
{'n_neighbors': [200, 400, 600, 800, 1000]}
In [47]:
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
In [48]:
# time the grid search using an IPython "magic function"
%time grid.fit(X, y)
CPU times: user 3min 30s, sys: 25 s, total: 3min 55s
Wall time: 3min 5s
Out[48]:
GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=100, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [200, 400, 600, 800, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)
In [49]:
# examine the scores for each parameter option
grid.grid_scores_
/Users/georgioskarakostas/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py:762: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[49]:
[mean: 0.21949, std: 0.00181, params: {'n_neighbors': 200},
 mean: 0.21994, std: 0.00331, params: {'n_neighbors': 400},
 mean: 0.22213, std: 0.00154, params: {'n_neighbors': 600},
 mean: 0.22296, std: 0.00191, params: {'n_neighbors': 800},
 mean: 0.22193, std: 0.00169, params: {'n_neighbors': 1000}]
In [50]:
# extract only the mean scores
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
grid_mean_scores
/Users/georgioskarakostas/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py:762: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[50]:
[0.2194901191733293,
 0.21994267612007845,
 0.2221300346960326,
 0.22295972243173934,
 0.2219288982752552]
In [51]:
# line plot of K value (x-axis) versus accuracy (y-axis)
import matplotlib.pyplot as plt
plt.plot(list(range(200, 1001, 200)), grid_mean_scores)
Out[51]:
[<matplotlib.lines.Line2D at 0x1190bfc88>]
In [52]:
# print the single best score and parameters that produced that score
print(grid.best_score_)
print(grid.best_params_)
0.22295972243173934
{'n_neighbors': 800}

Part 6: Extracting features from text using CountVectorizer

In [53]:
# reminder: ingredients are stored as a list of strings, not as a string
train.loc[0, 'ingredients']
Out[53]:
['romaine lettuce',
 'black olives',
 'grape tomatoes',
 'garlic',
 'pepper',
 'purple onion',
 'seasoning',
 'garbanzo beans',
 'feta cheese crumbles']
In [54]:
# convert each list of ingredients into a string
train.ingredients.astype(str)[0]
Out[54]:
"['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']"
In [55]:
# update make_features to create a new column 'ingredients_str'
def make_features(df):
    df['num_ingredients'] = df.ingredients.apply(len)
    df['ingredient_length'] = df.ingredients.apply(lambda x: np.mean([len(item) for item in x]))
    df['ingredients_str'] = df.ingredients.astype(str)
    return df
In [56]:
# run make_features and check that it worked
train = make_features(pd.read_json('../data/train.json'))
train.loc[0, 'ingredients_str']
Out[56]:
"['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']"
In [57]:
# define X and y
X = train.ingredients_str
y = train.cuisine
In [58]:
# import and instantiate CountVectorizer (with default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect
Out[58]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
In [59]:
# create a document-term matrix from all of the training data
X_dtm = vect.fit_transform(X)
X_dtm.shape
Out[59]:
(39774, 3010)
In [60]:
# examine the features that were created
print(vect.get_feature_names()[0:100])
['00', '10', '100', '14', '15', '25', '33', '40', '43', '95', '96', 'abalone', 'abbamele', 'absinthe', 'abura', 'acai', 'accent', 'accompaniment', 'achiote', 'acid', 'acini', 'ackee', 'acorn', 'acting', 'activ', 'active', 'added', 'adobo', 'adzuki', 'agar', 'agave', 'age', 'aged', 'ahi', 'aioli', 'ajinomoto', 'ajwain', 'aka', 'alaskan', 'albacore', 'alcohol', 'ale', 'aleppo', 'alexia', 'alfalfa', 'alfredo', 'all', 'allspice', 'almond', 'almondmilk', 'almonds', 'aloe', 'alphabet', 'alum', 'amaranth', 'amarena', 'amaretti', 'amaretto', 'amba', 'amber', 'amberjack', 'amchur', 'america', 'american', 'aminos', 'ammonium', 'amontillado', 'ampalaya', 'an', 'anaheim', 'anasazi', 'ancho', 'anchovies', 'anchovy', 'and', 'andouille', 'anejo', 'angel', 'anglaise', 'angled', 'angostura', 'angus', 'anise', 'anisette', 'anjou', 'annatto', 'any', 'aonori', 'apple', 'apples', 'applesauce', 'applewood', 'apricot', 'apricots', 'aquavit', 'arak', 'arame', 'arbol', 'arborio', 'arctic']
In [61]:
# replace the regex pattern that is used for tokenization
vect = CountVectorizer(token_pattern=r"'([a-z ]+)'")
X_dtm = vect.fit_transform(X)
X_dtm.shape
Out[61]:
(39774, 6250)
In [62]:
# examine the features that were created
print(vect.get_feature_names()[0:100])
['a taste of thai rice noodles', 'abalone', 'abbamele', 'absinthe', 'abura age', 'acai juice', 'accent', 'accent seasoning', 'accompaniment', 'achiote', 'achiote paste', 'achiote powder', 'acini di pepe', 'ackee', 'acorn squash', 'active dry yeast', 'adobo', 'adobo all purpose seasoning', 'adobo sauce', 'adobo seasoning', 'adobo style seasoning', 'adzuki beans', 'agar', 'agar agar flakes', 'agave nectar', 'agave tequila', 'aged balsamic vinegar', 'aged cheddar cheese', 'aged gouda', 'aged manchego cheese', 'ahi', 'ahi tuna steaks', 'aioli', 'ajinomoto', 'ajwain', 'aka miso', 'alaskan king crab legs', 'alaskan king salmon', 'albacore', 'albacore tuna in water', 'alcohol', 'ale', 'aleppo', 'aleppo pepper', 'alexia waffle fries', 'alfalfa sprouts', 'alfredo sauce', 'alfredo sauce mix', 'all beef hot dogs', 'all potato purpos', 'all purpose seasoning', 'all purpose unbleached flour', 'allspice', 'allspice berries', 'almond butter', 'almond extract', 'almond filling', 'almond flour', 'almond liqueur', 'almond meal', 'almond milk', 'almond oil', 'almond paste', 'almond syrup', 'almonds', 'aloe juice', 'alphabet pasta', 'alum', 'amaranth', 'amarena cherries', 'amaretti', 'amaretti cookies', 'amaretto', 'amaretto liqueur', 'amba', 'amber', 'amber rum', 'amberjack fillet', 'amchur', 'america', 'american cheese', 'american cheese food', 'american cheese slices', 'ammonium bicarbonate', 'amontillado sherry', 'ampalaya', 'anaheim chile', 'anasazi beans', 'ancho', 'ancho chile pepper', 'ancho chili ground pepper', 'ancho powder', 'anchovies', 'anchovy filets', 'anchovy fillets', 'anchovy paste', 'and carrot green pea', 'and cook drain pasta ziti', 'and fat free half half', 'andouille chicken sausage']
In [63]:
# import and instantiate Multinomial Naive Bayes (with the default parameters)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
In [64]:
# slightly improper cross-validation
cross_val_score(nb, X_dtm, y, cv=5, scoring='accuracy').mean()
Out[64]:
0.7301267156198039

Why is this improper cross-validation?

  • Normally, we split the data into training and testing sets before creating the document-term matrix. But since cross_val_score does the splitting for you, we passed it the feature matrix (X_dtm) rather than the raw text (X).
  • However, that does not appropriately simulate the real world, in which your out-of-sample data will contain features that were not seen during model training.

What's the solution?

  • We need a way to pass X (not X_dtm) to cross_val_score, and have the feature creation (via CountVectorizer) occur within each fold of cross-validation.
  • We will do this by using a Pipeline.

Part 7: Chaining steps into a Pipeline

In [65]:
# examine the numeric columns of the training data
train.describe()
Out[65]:
id num_ingredients ingredient_length
count 39774.000000 39774.000000 39774.000000
mean 24849.536959 10.767713 11.733187
std 14360.035505 4.428978 2.364183
min 0.000000 1.000000 4.000000
25% 12398.250000 8.000000 10.200000
50% 24887.000000 10.000000 11.625000
75% 37328.500000 13.000000 13.117647
max 49717.000000 65.000000 31.400000
In [66]:
# define '1' as a missing value and impute a replacement using the median
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=1, strategy='median')
In [67]:
# create a pipeline of missing value imputation and KNN
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, knn)
In [68]:
# examine the pipeline steps
pipe.steps
Out[68]:
[('imputer',
  Imputer(axis=0, copy=True, missing_values=1, strategy='median', verbose=0)),
 ('kneighborsclassifier',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=100, p=2,
             weights='uniform'))]
In [69]:
# alternative method for creating the identical pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([('imputer', imp), ('kneighborsclassifier', knn)])
In [70]:
# fit and predict using the entire pipeline
pipe.fit(X_train, y_train)
y_pred_class = pipe.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)
Out[70]:
0.22043443282381336

Using a Pipeline for proper cross-validation

In [71]:
# create a pipeline of vectorization and Naive Bayes
pipe = make_pipeline(vect, nb)
pipe.steps
Out[71]:
[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None, stop_words=None,
          strip_accents=None, token_pattern="'([a-z ]+)'", tokenizer=None,
          vocabulary=None)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

We can now pass X (instead of X_dtm) to cross_val_score, and the vectorization will occur within each fold of cross-validation.

In [72]:
# X is just a Series of strings
X.head()
Out[72]:
0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object
In [73]:
# cross-validate the entire pipeline
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
Out[73]:
0.7322884933790151

Making predictions using a Pipeline

In [74]:
# fit the pipeline (rather than just the model)
pipe.fit(X, y)
Out[74]:
Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern="'([a-z ]+)'", tokenizer=None,
        vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
In [75]:
# read in test.json and add the additional features
new = make_features(pd.read_json('../data/test.json'))
In [76]:
# define X_new as a Series of strings
X_new = new.ingredients_str
In [77]:
# use the pipeline to make predictions for the new data
new_pred_class_pipe = pipe.predict(X_new)
In [78]:
# create a submission file (score: 0.73663)
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class_pipe}).set_index('id').to_csv('sub2.csv')