Advanced Machine Learning Techniques

Agenda

  1. Reading in the Kaggle data and adding features
  2. Using a Pipeline for proper cross-validation
  3. Combining GridSearchCV with Pipeline
  4. Efficiently searching for tuning parameters using RandomizedSearchCV
  5. Adding features to a document-term matrix (using SciPy)
  6. Adding features to a document-term matrix (using FeatureUnion)
  7. Ensembling models
  8. Locating groups of similar cuisines
  9. Model stacking

Part 1: Reading in the Kaggle data and adding features

  • Our goal is to predict the cuisine of a recipe, given its ingredients.
  • Feature engineering is the process through which you create features that don't natively exist in the dataset.
In [1]:
import pandas as pd
import numpy as np
In [2]:
# define a function that accepts a DataFrame and adds new features
def make_features(df):
    
    # number of ingredients
    df['num_ingredients'] = df.ingredients.apply(len)
    
    # mean length of ingredient names
    df['ingredient_length'] = df.ingredients.apply(lambda x: np.mean([len(item) for item in x]))
    
    # string representation of the ingredient list
    df['ingredients_str'] = df.ingredients.astype(str)
    
    return df
In [3]:
# create the same features in the training data and the new data
train = make_features(pd.read_json('../data/train.json'))
new = make_features(pd.read_json('../data/test.json'))
In [4]:
train.head()
Out[4]:
cuisine id ingredients num_ingredients ingredient_length ingredients_str
0 greek 10259 [romaine lettuce, black olives, grape tomatoes... 9 12.000000 ['romaine lettuce', 'black olives', 'grape tom...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... 11 10.090909 ['plain flour', 'ground pepper', 'salt', 'toma...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... 12 10.333333 ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3 indian 22213 [water, vegetable oil, wheat, salt] 4 6.750000 ['water', 'vegetable oil', 'wheat', 'salt']
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... 20 10.100000 ['black pepper', 'shallots', 'cornflour', 'cay...
In [5]:
train.shape
Out[5]:
(39774, 6)
In [6]:
new.head()
Out[6]:
id ingredients num_ingredients ingredient_length ingredients_str
0 18009 [baking powder, eggs, all-purpose flour, raisi... 6 9.333333 ['baking powder', 'eggs', 'all-purpose flour',...
1 28583 [sugar, egg yolks, corn starch, cream of tarta... 11 10.272727 ['sugar', 'egg yolks', 'corn starch', 'cream o...
2 41580 [sausage links, fennel bulb, fronds, olive oil... 6 9.666667 ['sausage links', 'fennel bulb', 'fronds', 'ol...
3 29752 [meat cuts, file powder, smoked sausage, okra,... 21 12.000000 ['meat cuts', 'file powder', 'smoked sausage',...
4 35687 [ground black pepper, salt, sausage casings, l... 8 13.000000 ['ground black pepper', 'salt', 'sausage casin...
In [7]:
new.shape
Out[7]:
(9944, 5)

Part 2: Using a Pipeline for proper cross-validation

In [8]:
# define X and y
X = train.ingredients_str
y = train.cuisine
In [9]:
# X is just a Series of strings
X.head()
Out[9]:
0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object
In [10]:
# replace the regex pattern that is used for tokenization
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(token_pattern=r"'([a-z ]+)'")
In [11]:
# import and instantiate Multinomial Naive Bayes (with the default parameters)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
In [12]:
# create a pipeline of vectorization and Naive Bayes
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vect, nb)
In [13]:
# examine the pipeline steps
pipe.steps
Out[13]:
[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None, stop_words=None,
          strip_accents=None, token_pattern="'([a-z ]+)'", tokenizer=None,
          vocabulary=None)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]

Proper cross-validation:

  • By passing our pipeline to cross_val_score, features will be created from X (via CountVectorizer) within each fold of cross-validation.
  • This process simulates the real world, in which your out-of-sample data will contain features that were not seen during model training.
In [14]:
# cross-validate the entire pipeline
from sklearn.cross_validation import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
/Users/georgioskarakostas/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Out[14]:
0.7322884933790151

Part 3: Combining GridSearchCV with Pipeline

  • We use GridSearchCV to locate optimal tuning parameters by performing an "exhaustive grid search" of different parameter combinations, searching for the combination that has the best cross-validated accuracy.
  • By passing a Pipeline to GridSearchCV (instead of just a model), we can search tuning parameters for both the vectorizer and the model.
In [15]:
# pipeline steps are automatically assigned names by make_pipeline
pipe.named_steps.keys()
Out[15]:
dict_keys(['countvectorizer', 'multinomialnb'])
In [16]:
# create a grid of parameters to search (and specify the pipeline step along with the parameter)
param_grid = {}
param_grid['countvectorizer__token_pattern'] = [r"\b\w\w+\b", r"'([a-z ]+)'"]
param_grid['multinomialnb__alpha'] = [0.5, 1]
param_grid
Out[16]:
{'countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"],
 'multinomialnb__alpha': [0.5, 1]}
In [17]:
# pass the pipeline (instead of the model) to GridSearchCV
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
/Users/georgioskarakostas/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
In [18]:
# time the grid search
%time grid.fit(X, y)
CPU times: user 32.9 s, sys: 878 ms, total: 33.7 s
Wall time: 22.7 s
Out[18]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), p...  vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"], 'multinomialnb__alpha': [0.5, 1]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)
In [19]:
# examine the score for each combination of parameters
grid.grid_scores_
Out[19]:
[mean: 0.72422, std: 0.00457, params: {'countvectorizer__token_pattern': '\\b\\w\\w+\\b', 'multinomialnb__alpha': 0.5},
 mean: 0.72351, std: 0.00469, params: {'countvectorizer__token_pattern': '\\b\\w\\w+\\b', 'multinomialnb__alpha': 1},
 mean: 0.74770, std: 0.00460, params: {'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.5},
 mean: 0.73229, std: 0.00552, params: {'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 1}]
In [20]:
# print the single best score and parameters that produced that score
print(grid.best_score_)
print(grid.best_params_)
0.7476995021873586
{'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.5}

Part 4: Efficiently searching for tuning parameters using RandomizedSearchCV

  • When there are many parameters to tune, searching all possible combinations of parameter values may be computationally infeasible.
  • RandomizedSearchCV searches a sample of the parameter values, and you control the computational "budget".

RandomizedSearchCV documentation

In [21]:
from sklearn.grid_search import RandomizedSearchCV
In [22]:
# for any continuous parameters, specify a distribution instead of a list of options
import scipy as sp
param_grid = {}
param_grid['countvectorizer__token_pattern'] = [r"\b\w\w+\b", r"'([a-z ]+)'"]
param_grid['countvectorizer__min_df'] = [1, 2, 3]
param_grid['multinomialnb__alpha'] = sp.stats.uniform(scale=1)
param_grid
Out[22]:
{'countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"],
 'countvectorizer__min_df': [1, 2, 3],
 'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_frozen at 0x120db5ef0>}
In [23]:
# set a random seed for sp.stats.uniform
np.random.seed(1)
In [24]:
# additional parameters are n_iter (number of searches) and random_state
rand = RandomizedSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_iter=5, random_state=1)
In [25]:
# time the randomized search
%time rand.fit(X, y)
CPU times: user 40.2 s, sys: 1.08 s, total: 41.2 s
Wall time: 27.5 s
Out[25]:
RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), p...  vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
          fit_params={}, iid=True, n_iter=5, n_jobs=1,
          param_distributions={'countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"], 'countvectorizer__min_df': [1, 2, 3], 'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x120db5ef0>},
          pre_dispatch='2*n_jobs', random_state=1, refit=True,
          scoring='accuracy', verbose=0)
In [26]:
rand.grid_scores_
Out[26]:
[mean: 0.74986, std: 0.00494, params: {'countvectorizer__min_df': 2, 'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.417022004702574},
 mean: 0.72434, std: 0.00444, params: {'countvectorizer__min_df': 1, 'countvectorizer__token_pattern': '\\b\\w\\w+\\b', 'multinomialnb__alpha': 0.7203244934421581},
 mean: 0.72829, std: 0.00537, params: {'countvectorizer__min_df': 2, 'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.00011437481734488664},
 mean: 0.75137, std: 0.00438, params: {'countvectorizer__min_df': 2, 'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.30233257263183977},
 mean: 0.72218, std: 0.00438, params: {'countvectorizer__min_df': 1, 'countvectorizer__token_pattern': '\\b\\w\\w+\\b', 'multinomialnb__alpha': 0.14675589081711304}]
In [27]:
print(rand.best_score_)
print(rand.best_params_)
0.751370241866546
{'countvectorizer__min_df': 2, 'countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.30233257263183977}

Making predictions for new data

In [28]:
# define X_new as the ingredient text
X_new = new.ingredients_str
In [29]:
# print the best model found by RandomizedSearchCV
rand.best_estimator_
Out[29]:
Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern="'([a-z ]+)'", tokenizer=None,
        vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=0.30233257263183977, class_prior=None, fit_prior=True))])
In [30]:
# RandomizedSearchCV/GridSearchCV automatically refit the best model with the entire dataset, and can be used to make predictions
new_pred_class_rand = rand.predict(X_new)
new_pred_class_rand
Out[30]:
array(['british', 'southern_us', 'italian', ..., 'italian', 'southern_us',
       'mexican'], dtype='<U12')
In [31]:
# create a submission file (score: 0.75342)
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class_rand}).set_index('id').to_csv('sub3.csv')

Part 5: Adding features to a document-term matrix (using SciPy)

  • So far, we've trained models on either the document-term matrix or the manually created features, but not both.
  • To train a model on both types of features, we need to combine them into a single feature matrix.
  • Because one of the matrices is sparse and the other is dense, the easiest way to combine them is by using SciPy.
In [32]:
# create a document-term matrix from all of the training data
X_dtm = vect.fit_transform(X)
X_dtm.shape
Out[32]:
(39774, 6250)
In [33]:
type(X_dtm)
Out[33]:
scipy.sparse.csr.csr_matrix
In [34]:
# create a DataFrame of the manually created features
X_manual = train.loc[:, ['num_ingredients', 'ingredient_length']]
X_manual.shape
Out[34]:
(39774, 2)
In [35]:
# create a sparse matrix from the DataFrame
X_manual_sparse = sp.sparse.csr_matrix(X_manual)
type(X_manual_sparse)
Out[35]:
scipy.sparse.csr.csr_matrix
In [36]:
# combine the two sparse matrices
X_dtm_manual = sp.sparse.hstack([X_dtm, X_manual_sparse])
X_dtm_manual.shape
Out[36]:
(39774, 6252)
  • This was a relatively easy process.
  • However, it does not allow us to do proper cross-validation, and it doesn't integrate well with the rest of the scikit-learn workflow.

Part 6: Adding features to a document-term matrix (using FeatureUnion)

  • Below is an alternative process that does allow for proper cross-validation, and does integrate well with the scikit-learn workflow.
  • To use this process, we have to learn about transformers, FunctionTransformer, and FeatureUnion.

What are "transformers"?

Transformer objects provide a transform method in order to perform data transformations. Here are a few examples:

  • CountVectorizer
    • fit learns the vocabulary
    • transform creates a document-term matrix using the vocabulary
  • Imputer
    • fit learns the value to impute
    • transform fills in missing entries using the imputation value
  • StandardScaler
    • fit learns the mean and scale of each feature
    • transform standardizes the features using the mean and scale
  • HashingVectorizer
    • fit is not used, and thus it is known as a "stateless" transformer
    • transform creates the document-term matrix using a hash of the token

Converting a function into a transformer

In [37]:
# define a function that accepts a DataFrame returns the manually created features
def get_manual(df):
    return df.loc[:, ['num_ingredients', 'ingredient_length']]
In [38]:
get_manual(train).head()
Out[38]:
num_ingredients ingredient_length
0 9 12.000000
1 11 10.090909
2 12 10.333333
3 4 6.750000
4 20 10.100000
In [39]:
from sklearn.preprocessing import FunctionTransformer
In [40]:
# create a stateless transformer from the get_manual function
get_manual_ft = FunctionTransformer(get_manual, validate=False)
type(get_manual_ft)
Out[40]:
sklearn.preprocessing._function_transformer.FunctionTransformer
In [41]:
# execute the function using the transform method
get_manual_ft.transform(train).head()
Out[41]:
num_ingredients ingredient_length
0 9 12.000000
1 11 10.090909
2 12 10.333333
3 4 6.750000
4 20 10.100000
In [42]:
# define a function that accepts a DataFrame returns the ingredients string
def get_text(df):
    return df.ingredients_str
In [43]:
# create and test another transformer
get_text_ft = FunctionTransformer(get_text, validate=False)
get_text_ft.transform(train).head()
Out[43]:
0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object

Combining feature extraction steps

  • FeatureUnion applies a list of transformers in parallel to the input data (not sequentially), then concatenates the results.
  • This is useful for combining several feature extraction mechanisms into a single transformer.

Pipeline versus FeatureUnion

In [44]:
from sklearn.pipeline import make_union
In [45]:
# create a document-term matrix from all of the training data
X_dtm = vect.fit_transform(X)
X_dtm.shape
Out[45]:
(39774, 6250)
In [46]:
# this is identical to a FeatureUnion with just one transformer
union = make_union(vect)
X_dtm = union.fit_transform(X)
X_dtm.shape
Out[46]:
(39774, 6250)
In [47]:
# try to add a second transformer to the Feature Union (what's wrong with this?)
# union = make_union(vect, get_manual_ft)
# X_dtm_manual = union.fit_transform(X)
In [48]:
# properly combine the transformers into a FeatureUnion
union = make_union(make_pipeline(get_text_ft, vect), get_manual_ft)
X_dtm_manual = union.fit_transform(train)
X_dtm_manual.shape
Out[48]:
(39774, 6252)

Pipeline in a FeatureUnion

Cross-validation

In [49]:
# slightly improper cross-validation
cross_val_score(nb, X_dtm_manual, y, cv=5, scoring='accuracy').mean()
Out[49]:
0.7102895106852953
In [50]:
# create a pipeline of the FeatureUnion and Naive Bayes
pipe = make_pipeline(union, nb)
In [51]:
# properly cross-validate the entire pipeline (and pass it the entire DataFrame)
cross_val_score(pipe, train, y, cv=5, scoring='accuracy').mean()
Out[51]:
0.7134318388611878

Alternative way to specify Pipeline and FeatureUnion

In [52]:
# reminder of how we created the pipeline
union = make_union(make_pipeline(get_text_ft, vect), get_manual_ft)
pipe = make_pipeline(union, nb)
In [53]:
# duplicate the pipeline structure without using make_pipeline or make_union
from sklearn.pipeline import Pipeline, FeatureUnion
pipe = Pipeline([
    ('featureunion', FeatureUnion([
            ('pipeline', Pipeline([
                    ('functiontransformer', get_text_ft),
                    ('countvectorizer', vect)
                    ])),
            ('functiontransformer', get_manual_ft)
        ])),
    ('multinomialnb', nb)
])

Grid search of a nested Pipeline

In [54]:
# examine the pipeline steps
pipe.steps
Out[54]:
[('featureunion', FeatureUnion(n_jobs=1,
         transformer_list=[('pipeline', Pipeline(memory=None,
       steps=[('functiontransformer', FunctionTransformer(accept_sparse=False,
            func=<function get_text at 0x118154e18>, inv_kw_args=None,
            inverse_func=None, kw_args=None, pass_y='deprecated',
            validate=False)), ('countve...gs=None,
            inverse_func=None, kw_args=None, pass_y='deprecated',
            validate=False))],
         transformer_weights=None)),
 ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]
In [55]:
# create a grid of parameters to search (and specify the pipeline step along with the parameter)
param_grid = {}
param_grid['featureunion__pipeline__countvectorizer__token_pattern'] = [r"\b\w\w+\b", r"'([a-z ]+)'"]
param_grid['multinomialnb__alpha'] = [0.5, 1]
param_grid
Out[55]:
{'featureunion__pipeline__countvectorizer__token_pattern': ['\\b\\w\\w+\\b',
  "'([a-z ]+)'"],
 'multinomialnb__alpha': [0.5, 1]}
In [56]:
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
In [57]:
%time grid.fit(train, y)
CPU times: user 45.1 s, sys: 1.63 s, total: 46.8 s
Wall time: 23.6 s
Out[57]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline', Pipeline(memory=None,
     steps=[('functiontransformer', FunctionTransformer(accept_sparse=False,
          func=<function get_text at 0x118154e18>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass...ormer_weights=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'featureunion__pipeline__countvectorizer__token_pattern': ['\\b\\w\\w+\\b', "'([a-z ]+)'"], 'multinomialnb__alpha': [0.5, 1]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)
In [58]:
print(grid.best_score_)
print(grid.best_params_)
0.7426710916679238
{'featureunion__pipeline__countvectorizer__token_pattern': "'([a-z ]+)'", 'multinomialnb__alpha': 0.5}

Part 7: Ensembling models

Rather than combining features into a single feature matrix and training a single model, we can instead create separate models and "ensemble" them.

What is ensembling?

Ensemble learning (or "ensembling") is the process of combining several predictive models in order to produce a combined model that is better than any individual model.

  • Regression: average the predictions made by the individual models
  • Classification: let the models "vote" and use the most common prediction, or average the predicted probabilities

For ensembling to work well, the models must have the following characteristics:

  • Accurate: they outperform the null model
  • Independent: their predictions are generated using different "processes", such as:
    • different types of models
    • different features
    • different tuning parameters

The big idea: If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.

Note: There are also models that have built-in ensembling, such as Random Forests.

Model 1: KNN model using only manually created features

In [59]:
# define X and y
feature_cols = ['num_ingredients', 'ingredient_length']
X = train[feature_cols]
y = train.cuisine
In [60]:
# use KNN with K=800
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=800)
In [61]:
# train KNN on all of the training data
knn.fit(X, y)
Out[61]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=800, p=2,
           weights='uniform')
In [62]:
# define X_new as the manually created features
X_new = new[feature_cols]
In [63]:
# calculate predicted probabilities of class membership for the new data
new_pred_prob_knn = knn.predict_proba(X_new)
new_pred_prob_knn.shape
Out[63]:
(9944, 20)
In [64]:
# print predicted probabilities for the first row only
new_pred_prob_knn[0, :]
Out[64]:
array([0.02625, 0.0275 , 0.01375, 0.04375, 0.03375, 0.08   , 0.0175 ,
       0.075  , 0.0275 , 0.135  , 0.01   , 0.075  , 0.01875, 0.165  ,
       0.00875, 0.0125 , 0.1525 , 0.025  , 0.0275 , 0.025  ])
In [65]:
# display classes with probabilities
zip(knn.classes_, new_pred_prob_knn[0, :])
Out[65]:
<zip at 0x117ac1a08>
In [66]:
# predicted probabilities will sum to 1 for each row
new_pred_prob_knn[0, :].sum()
Out[66]:
1.0

Model 2: Naive Bayes model using only text features

In [67]:
# print the best model found by RandomizedSearchCV
rand.best_estimator_
Out[67]:
Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern="'([a-z ]+)'", tokenizer=None,
        vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=0.30233257263183977, class_prior=None, fit_prior=True))])
In [68]:
# define X_new as the ingredient text
X_new = new.ingredients_str
In [69]:
# calculate predicted probabilities of class membership for the new data
new_pred_prob_rand = rand.predict_proba(X_new)
new_pred_prob_rand.shape
Out[69]:
(9944, 20)
In [70]:
# print predicted probabilities for the first row only
new_pred_prob_rand[0, :]
Out[70]:
array([6.35624509e-04, 5.10677208e-01, 5.01039760e-05, 7.46758455e-05,
       3.64528916e-03, 1.36909784e-03, 4.25463842e-04, 3.16817133e-04,
       1.85847350e-01, 3.78331630e-03, 2.67495007e-04, 5.60369424e-04,
       4.27190054e-06, 8.85175984e-04, 8.50499605e-06, 3.04368393e-02,
       2.60701445e-01, 3.09630257e-04, 1.07646647e-06, 2.45297976e-07])

Ensembling models 1 and 2

In [71]:
# calculate the mean of the predicted probabilities for the first row
(new_pred_prob_knn[0, :] + new_pred_prob_rand[0, :]) / 2
Out[71]:
array([0.01344281, 0.2690886 , 0.00690005, 0.02191234, 0.01869764,
       0.04068455, 0.00896273, 0.03765841, 0.10667368, 0.06939166,
       0.00513375, 0.03778018, 0.00937714, 0.08294259, 0.00437925,
       0.02146842, 0.20660072, 0.01265482, 0.01375054, 0.01250012])
In [72]:
# calculate the mean of the predicted probabilities for all rows
new_pred_prob = pd.DataFrame((new_pred_prob_knn + new_pred_prob_rand) / 2, columns=knn.classes_)
new_pred_prob.head()
Out[72]:
brazilian british cajun_creole chinese filipino french greek indian irish italian jamaican japanese korean mexican moroccan russian southern_us spanish thai vietnamese
0 0.013443 0.269089 0.006900 0.021912 0.018698 0.040685 0.008963 0.037658 0.106674 0.069392 0.005134 0.037780 0.009377 0.082943 0.004379 0.021468 0.206601 0.012655 0.013751 0.012500
1 0.008752 0.011324 0.016875 0.045000 0.018132 0.023884 0.015625 0.046250 0.010629 0.070625 0.005626 0.027501 0.021875 0.066875 0.008125 0.008750 0.547901 0.007500 0.025625 0.013125
2 0.013158 0.009389 0.006951 0.020000 0.015010 0.041365 0.010101 0.029376 0.013372 0.408696 0.005628 0.038752 0.007500 0.080630 0.025887 0.008240 0.079377 0.158440 0.015625 0.012502
3 0.003125 0.004375 0.533750 0.038750 0.001875 0.023125 0.006250 0.075625 0.001250 0.051875 0.011875 0.008125 0.003125 0.107500 0.029375 0.001875 0.025000 0.007500 0.038125 0.027500
4 0.001878 0.009856 0.020097 0.021250 0.003125 0.044922 0.017501 0.013750 0.012547 0.640841 0.003752 0.007500 0.003750 0.083129 0.004376 0.003135 0.072838 0.018252 0.014375 0.003125
In [73]:
# for each row, find the column with the highest predicted probability
new_pred_class = new_pred_prob.apply(np.argmax, axis=1)
new_pred_class.head()
/Users/georgioskarakostas/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py:51: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)
Out[73]:
0         british
1     southern_us
2         italian
3    cajun_creole
4         italian
dtype: object
In [74]:
# create a submission file (score: 0.75241)
pd.DataFrame({'id':new.id, 'cuisine':new_pred_class}).set_index('id').to_csv('sub4.csv')

Note: VotingClassifier (new in 0.17) makes it easier to ensemble classifiers, though it is limited to the case in which all of the classifiers are fit to the same data.

Part 8: Locating groups of similar cuisines

In [75]:
# for each cuisine, combine all of the recipes into a single string
cuisine_ingredients = train.groupby('cuisine').ingredients_str.sum()
cuisine_ingredients
Out[75]:
cuisine
brazilian       ['ice cubes', 'club soda', 'white rum', 'lime'...
british         ['greek yogurt', 'lemon curd', 'confectioners ...
cajun_creole    ['herbs', 'lemon juice', 'fresh tomatoes', 'pa...
chinese         ['low sodium soy sauce', 'fresh ginger', 'dry ...
filipino        ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
french          ['sugar', 'salt', 'fennel bulb', 'water', 'lem...
greek           ['romaine lettuce', 'black olives', 'grape tom...
indian          ['water', 'vegetable oil', 'wheat', 'salt']['b...
irish           ['cooking spray', 'salt', 'black pepper', 'yuk...
italian         ['sugar', 'pistachio nuts', 'white almond bark...
jamaican        ['plain flour', 'sugar', 'butter', 'eggs', 'fr...
japanese        ['sirloin', 'mirin', 'yellow onion', 'low sodi...
korean          ['jasmine rice', 'garlic', 'scallions', 'sugar...
mexican         ['olive oil', 'purple onion', 'fresh pineapple...
moroccan        ['ground cloves', 'whole nutmegs', 'ground gin...
russian         ['water', 'grits', 'mozzarella cheese', 'salt'...
southern_us     ['plain flour', 'ground pepper', 'salt', 'toma...
spanish         ['olive oil', 'salt', 'medium shrimp', 'pepper...
thai            ['sugar', 'hot chili', 'asian fish sauce', 'li...
vietnamese      ['soy sauce', 'vegetable oil', 'red bell peppe...
Name: ingredients_str, dtype: object
In [76]:
# examine the brazilian ingredients
cuisine_ingredients['brazilian'][0:500]
Out[76]:
"['ice cubes', 'club soda', 'white rum', 'lime', 'turbinado']['eggs', 'hearts of palm', 'cilantro', 'coconut cream', 'flax seed meal', 'kosher salt', 'jalapeno chilies', 'garlic', 'cream cheese, soften', 'coconut oil', 'lime juice', 'crushed red pepper flakes', 'ground coriander', 'pepper', 'chicken breasts', 'coconut flour', 'onions']['sweetened condensed milk', 'butter', 'cocoa powder']['lime', 'crushed ice', 'simple syrup', 'cachaca']['sugar', 'corn starch', 'egg whites', 'boiling water', 'col"
In [77]:
# confirm that they match the brazilian recipes
train.loc[train.cuisine=='brazilian', 'ingredients_str'].head()
Out[77]:
41     ['ice cubes', 'club soda', 'white rum', 'lime'...
380    ['eggs', 'hearts of palm', 'cilantro', 'coconu...
423    ['sweetened condensed milk', 'butter', 'cocoa ...
509    ['lime', 'crushed ice', 'simple syrup', 'cacha...
724    ['sugar', 'corn starch', 'egg whites', 'boilin...
Name: ingredients_str, dtype: object
In [78]:
# create a document-term matrix from cuisine_ingredients
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
cuisine_dtm = vect.fit_transform(cuisine_ingredients)
cuisine_dtm.shape
Out[78]:
(20, 3010)
In [79]:
# calculate the cosine similarity between each cuisine and all other cuisines
from sklearn import metrics
cuisine_similarity = []
for idx in range(cuisine_dtm.shape[0]):
    similarity = metrics.pairwise.linear_kernel(cuisine_dtm[idx, :], cuisine_dtm).flatten()
    cuisine_similarity.append(similarity)
In [80]:
# convert the results to a DataFrame
cuisine_list = cuisine_ingredients.index
cuisine_similarity = pd.DataFrame(cuisine_similarity, index=cuisine_list, columns=cuisine_list)
cuisine_similarity
Out[80]:
cuisine brazilian british cajun_creole chinese filipino french greek indian irish italian jamaican japanese korean mexican moroccan russian southern_us spanish thai vietnamese
cuisine
brazilian 1.000000 0.660232 0.742324 0.580756 0.769216 0.756392 0.695692 0.687271 0.665713 0.740527 0.778320 0.555601 0.571440 0.743736 0.669009 0.706087 0.743156 0.807694 0.685539 0.653801
british 0.660232 1.000000 0.591230 0.467640 0.631356 0.859609 0.562750 0.560349 0.926682 0.632618 0.662057 0.508296 0.447177 0.560446 0.543260 0.909551 0.911271 0.604000 0.445518 0.478901
cajun_creole 0.742324 0.591230 1.000000 0.605581 0.746151 0.708849 0.688391 0.618955 0.635197 0.738159 0.780897 0.532394 0.578645 0.724877 0.649831 0.657802 0.747480 0.803637 0.590103 0.605224
chinese 0.580756 0.467640 0.605581 1.000000 0.839803 0.540446 0.496090 0.553532 0.460746 0.555504 0.635953 0.835587 0.866828 0.561837 0.505655 0.521844 0.558514 0.603526 0.755813 0.817005
filipino 0.769216 0.631356 0.746151 0.839803 1.000000 0.682939 0.607436 0.655934 0.641010 0.670628 0.792723 0.748558 0.782623 0.678302 0.614984 0.697000 0.720368 0.727409 0.741512 0.806833
french 0.756392 0.859609 0.708849 0.540446 0.682939 1.000000 0.759936 0.624868 0.837384 0.835272 0.723225 0.540279 0.502205 0.666830 0.685384 0.881173 0.862062 0.817541 0.548375 0.570925
greek 0.695692 0.562750 0.688391 0.496090 0.607436 0.759936 1.000000 0.640297 0.583675 0.859270 0.681281 0.469465 0.479835 0.696644 0.769412 0.649530 0.641229 0.837448 0.519004 0.538683
indian 0.687271 0.560349 0.618955 0.553532 0.655934 0.624868 0.640297 1.000000 0.577338 0.616211 0.734926 0.567993 0.538853 0.708621 0.795271 0.607432 0.617278 0.678865 0.627460 0.605162
irish 0.665713 0.926682 0.635197 0.460746 0.641010 0.837384 0.583675 0.577338 1.000000 0.649878 0.680914 0.494762 0.458798 0.591718 0.563303 0.892428 0.902850 0.630921 0.449931 0.481712
italian 0.740527 0.632618 0.738159 0.555504 0.670628 0.835272 0.859270 0.616211 0.649878 1.000000 0.695768 0.510280 0.522568 0.733959 0.709827 0.697593 0.718945 0.858166 0.555088 0.571096
jamaican 0.778320 0.662057 0.780897 0.635953 0.792723 0.723225 0.681281 0.734926 0.680914 0.695768 1.000000 0.584689 0.609203 0.731859 0.757462 0.684492 0.752862 0.751672 0.650898 0.664175
japanese 0.555601 0.508296 0.532394 0.835587 0.748558 0.540279 0.469465 0.567993 0.494762 0.510280 0.584689 1.000000 0.819828 0.506539 0.477401 0.557275 0.554022 0.547336 0.682604 0.738413
korean 0.571440 0.447177 0.578645 0.866828 0.782623 0.502205 0.479835 0.538853 0.458798 0.522568 0.609203 0.819828 1.000000 0.516461 0.477964 0.517680 0.516811 0.582969 0.671054 0.747119
mexican 0.743736 0.560446 0.724877 0.561837 0.678302 0.666830 0.696644 0.708621 0.591718 0.733959 0.731859 0.506539 0.516461 1.000000 0.697442 0.630541 0.691398 0.739874 0.617627 0.623531
moroccan 0.669009 0.543260 0.649831 0.505655 0.614984 0.685384 0.769412 0.795271 0.563303 0.709827 0.757462 0.477401 0.477964 0.697442 1.000000 0.608343 0.605958 0.784612 0.533128 0.553375
russian 0.706087 0.909551 0.657802 0.521844 0.697000 0.881173 0.649530 0.607432 0.892428 0.697593 0.684492 0.557275 0.517680 0.630541 0.608343 1.000000 0.877901 0.702752 0.494331 0.539100
southern_us 0.743156 0.911271 0.747480 0.558514 0.720368 0.862062 0.641229 0.617278 0.902850 0.718945 0.752862 0.554022 0.516811 0.691398 0.605958 0.877901 1.000000 0.707774 0.536965 0.562359
spanish 0.807694 0.604000 0.803637 0.603526 0.727409 0.817541 0.837448 0.678865 0.630921 0.858166 0.751672 0.547336 0.582969 0.739874 0.784612 0.702752 0.707774 1.000000 0.606200 0.614200
thai 0.685539 0.445518 0.590103 0.755813 0.741512 0.548375 0.519004 0.627460 0.449931 0.555088 0.650898 0.682604 0.671054 0.617627 0.533128 0.494331 0.536965 0.606200 1.000000 0.914986
vietnamese 0.653801 0.478901 0.605224 0.817005 0.806833 0.570925 0.538683 0.605162 0.481712 0.571096 0.664175 0.738413 0.747119 0.623531 0.553375 0.539100 0.562359 0.614200 0.914986 1.000000
In [81]:
# display the similarities as a heatmap
%matplotlib inline
import seaborn as sns
sns.heatmap(cuisine_similarity)
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x1218d28d0>
In [82]:
# hand-selected cuisine groups
group_1 = ['chinese', 'filipino', 'japanese', 'korean', 'thai', 'vietnamese']
group_2 = ['british', 'french', 'irish', 'russian', 'southern_us']
group_3 = ['greek', 'italian', 'moroccan', 'spanish']
group_4 = ['brazilian', 'cajun_creole', 'indian', 'jamaican', 'mexican']

Part 9: Model stacking

  • The term "model stacking" is used any time there are multiple "levels" of models, in which the outputs from one level are used as inputs to another level.
  • In this case, we will create one model that predicts the cuisine group for a recipe. Within each of the four groups, we will create another model that predicts the actual cuisine.
  • Our theory is that each of these five models may need to be tuned differently for maximum accuracy, but will ultimately result in a process that is more accurate than a single-level model.
In [83]:
# create a dictionary that maps each cuisine to its group number
cuisines = group_1 + group_2 + group_3 + group_4
group_numbers = [1]*len(group_1) + [2]*len(group_2) + [3]*len(group_3) + [4]*len(group_4)
cuisine_to_group = dict(zip(cuisines, group_numbers))
cuisine_to_group
Out[83]:
{'chinese': 1,
 'filipino': 1,
 'japanese': 1,
 'korean': 1,
 'thai': 1,
 'vietnamese': 1,
 'british': 2,
 'french': 2,
 'irish': 2,
 'russian': 2,
 'southern_us': 2,
 'greek': 3,
 'italian': 3,
 'moroccan': 3,
 'spanish': 3,
 'brazilian': 4,
 'cajun_creole': 4,
 'indian': 4,
 'jamaican': 4,
 'mexican': 4}
In [84]:
# map the cuisines to their group numbers
train['group'] = train.cuisine.map(cuisine_to_group)
train.head()
Out[84]:
cuisine id ingredients num_ingredients ingredient_length ingredients_str group
0 greek 10259 [romaine lettuce, black olives, grape tomatoes... 9 12.000000 ['romaine lettuce', 'black olives', 'grape tom... 3
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... 11 10.090909 ['plain flour', 'ground pepper', 'salt', 'toma... 2
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... 12 10.333333 ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki... 1
3 indian 22213 [water, vegetable oil, wheat, salt] 4 6.750000 ['water', 'vegetable oil', 'wheat', 'salt'] 4
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... 20 10.100000 ['black pepper', 'shallots', 'cornflour', 'cay... 4
In [85]:
# confirm that all recipes were assigned a cuisine group
train.group.isnull().sum()
Out[85]:
0
In [86]:
# calculate the cross-validated accuracy of using text to predict cuisine group
X = train.ingredients_str
y = train.group
pipe_main = make_pipeline(CountVectorizer(), MultinomialNB())
cross_val_score(pipe_main, X, y, cv=5, scoring='accuracy').mean()
Out[86]:
0.8276513701245822
In [87]:
# define an X and y for each cuisine group
X1 = train.loc[train.group==1, 'ingredients_str']
y1 = train.loc[train.group==1, 'cuisine']
X2 = train.loc[train.group==2, 'ingredients_str']
y2 = train.loc[train.group==2, 'cuisine']
X3 = train.loc[train.group==3, 'ingredients_str']
y3 = train.loc[train.group==3, 'cuisine']
X4 = train.loc[train.group==4, 'ingredients_str']
y4 = train.loc[train.group==4, 'cuisine']
In [88]:
# define a pipeline for each cuisine group
pipe_1 = make_pipeline(CountVectorizer(), MultinomialNB())
pipe_2 = make_pipeline(CountVectorizer(), MultinomialNB())
pipe_3 = make_pipeline(CountVectorizer(), MultinomialNB())
pipe_4 = make_pipeline(CountVectorizer(), MultinomialNB())
In [89]:
# within each cuisine group, calculate the cross-validated accuracy of using text to predict cuisine
print(cross_val_score(pipe_1, X1, y1, cv=5, scoring='accuracy').mean())
print(cross_val_score(pipe_2, X2, y2, cv=5, scoring='accuracy').mean())
print(cross_val_score(pipe_3, X3, y3, cv=5, scoring='accuracy').mean())
print(cross_val_score(pipe_4, X4, y4, cv=5, scoring='accuracy').mean())
0.7693031228079263
0.7568885301219405
0.8701840957938736
0.9043403347972706

Note: Ideally, each of the five pipelines should be individually tuned from start to finish, including feature engineering, model selection, and parameter tuning.

Making predictions for the new data

In [90]:
# fit each pipeline with the relevant X and y
pipe_main.fit(X, y)
pipe_1.fit(X1, y1)
pipe_2.fit(X2, y2)
pipe_3.fit(X3, y3)
pipe_4.fit(X4, y4)
Out[90]:
Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
In [91]:
# for the new data, first make cuisine group predictions
X_new = new.ingredients_str
new_pred_group = pipe_main.predict(X_new)
new_pred_group
Out[91]:
array([2, 2, 3, ..., 3, 4, 4])
In [92]:
# then within each predicted cuisine group, make cuisine predictions
new_pred_class_1 = pipe_1.predict(X_new[new_pred_group==1])
new_pred_class_2 = pipe_2.predict(X_new[new_pred_group==2])
new_pred_class_3 = pipe_3.predict(X_new[new_pred_group==3])
new_pred_class_4 = pipe_4.predict(X_new[new_pred_group==4])
print(new_pred_class_1)
print(new_pred_class_2)
print(new_pred_class_3)
print(new_pred_class_4)
['chinese' 'japanese' 'vietnamese' ... 'chinese' 'chinese' 'vietnamese']
['british' 'southern_us' 'southern_us' ... 'southern_us' 'french' 'french']
['spanish' 'italian' 'spanish' ... 'italian' 'italian' 'italian']
['cajun_creole' 'mexican' 'indian' ... 'mexican' 'cajun_creole' 'mexican']
In [93]:
# add the cuisine predictions to the DataFrame of new data
new.loc[new_pred_group==1, 'pred_class'] = new_pred_class_1
new.loc[new_pred_group==2, 'pred_class'] = new_pred_class_2
new.loc[new_pred_group==3, 'pred_class'] = new_pred_class_3
new.loc[new_pred_group==4, 'pred_class'] = new_pred_class_4
In [94]:
new.head()
Out[94]:
id ingredients num_ingredients ingredient_length ingredients_str pred_class
0 18009 [baking powder, eggs, all-purpose flour, raisi... 6 9.333333 ['baking powder', 'eggs', 'all-purpose flour',... british
1 28583 [sugar, egg yolks, corn starch, cream of tarta... 11 10.272727 ['sugar', 'egg yolks', 'corn starch', 'cream o... southern_us
2 41580 [sausage links, fennel bulb, fronds, olive oil... 6 9.666667 ['sausage links', 'fennel bulb', 'fronds', 'ol... spanish
3 29752 [meat cuts, file powder, smoked sausage, okra,... 21 12.000000 ['meat cuts', 'file powder', 'smoked sausage',... cajun_creole
4 35687 [ground black pepper, salt, sausage casings, l... 8 13.000000 ['ground black pepper', 'salt', 'sausage casin... italian
In [95]:
# create a submission file (score: 0.70475)
pd.DataFrame({'id':new.id, 'cuisine':new.pred_class}).set_index('id').to_csv('sub5.csv')