Notebook

Project Title: Geographic Classification of Cuisine Recipes¶

Jinhyeun Kim, Youngjo Kim, Jianyuan Zhai¶

1. Introduction¶

1-1. The goal of the project¶

Classify the geographic origins of recipes based on the ingredients used

Proposal Link

1-2. Why is it important?¶

Cuisines are different across different countries and are primarily affected by geographic conditions, such as local climate, religion, and trade. Therefore, classifying the cuisines can be used to improve our understanding of each culture and lifestyle

In [1]:

from IPython.display import Image, HTML, IFrame
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Foodworld.png?raw=true')

Out[1]:

2. Method¶

2-1. How did we approach the problem?¶

1. Get the data from Kaggle (Use 'pandas' to read json file into pandas dataframe)

2. Data Preprocessing: Construct a binary matrix X (39774 x 6714)

$\;\;\;\;\;\;$2-1) 39774 recipes from 20 countries and the total 6714 ingredients for the recipes
$\;\;\;\;\;\;$2-2) For recipe $i$, if ingredient $j$ is used, $X_{i,j}$ equals to 1, and if ingredient $j$ is not used, $X_{i,j}$ equals to 0

3. Train/Test Split(0.8, 0.2) of binary matrix X and the label y

4. Model Training and Selection: Use 'Scikit learn' to help train the classfiers

$\;\;\;\;\;\;$ 1. Perceptron
$\;\;\;\;\;\;$ 2. Logistic Regression
$\;\;\;\;\;\;$ 3. Linear SVM
$\;\;\;\;\;\;$ 4. ANN

5. Evaluate the model based on ROC Curves

In [2]:

IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/GeoChart(Country).html', width=970, height=650)

Out[2]:

In [3]:

img = 'https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/MainWorkflow.png?raw=true'
Image(url=img)

Out[3]:

2-2. Challenges approached¶

1. The imbalance of the sample matters in statistical estimation technique and can lead to classifier bias
2. The matrix is too large and takes a lot of time for running (for ANN, it took over 2 hours for the entire dataset)

How can we reduce the sample imbalance problem?

2-3. New in our approach¶

Take a Two-step Approach:

What if we classify first by the continents and then classify the country?
We assume that our two-step approach would not lead to the classification bias since we know that the food varies according to the region (Food and the region is highly correlated)

In [4]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Dendogram(country).PNG?raw=true')

Out[4]:

In [5]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Dendogram(continent).PNG?raw=true')

Out[5]:

In [6]:

IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/GeoChart(Continent).html', width=970, height=650)

Out[6]:

Through the two-step approach, we can reduce the imbalance of the sample problem¶

In [7]:

IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/PieChart(Country).html', width=800, height=400)

Out[7]:

In [8]:

IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/PieChart(Continent).html', width=800, height=400)

Out[8]:

There is a high imbalance for Africa since the dataset for Africa only contains 821 data (which is from Morocco)¶

We tried to 'Balance' class_weight (putting less weights on the majority class instances)
$\;\;$→ Results in slightly (~0.1%) higher training accuracy but lower testing accuracy than the normal logistic regression

In [9]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Logistic(Weight_balanced_compare).png?raw=true')

Out[9]:

Two-Step Approach Workflow¶

In [10]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Workflow3.png?raw=true')

Out[10]:

The goal is to correctly classify the recipe to the geographic origins¶

For example in our model, our model should be able to classify the spaghetti to the Italian

In [11]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Workflow4.png?raw=true')

Out[11]:

We can determine one best classifier in the continent level and three best classifiers in the country level for each continent (America, Asia, and Europe)¶

$\;\;\;\;\;\;$1. Use the ROC curves to determine the best model in each level
$\;\;\;\;\;$2. Combine the best model on the continent level and the country level

2-4. Evaluation of the Model¶

ROC Curves¶

In [12]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ROCCurvesContinent.png?raw=true')

Out[12]:

In [13]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ROCCurvesCountries.png?raw=true')

Out[13]:

We picked the four best models,¶

$\;\;\;\;\;\;$For Continent level, we picked logistic regression
$\;\;\;\;\;$For Country level (America), logistic regression
$\;\;\;\;\;$For Country level (Asia), logistic regression
$\;\;\;\;\;$For Country level (EU), SVM

and connect the best models in sequence¶

In [14]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Accuracycompare1.png?raw=true')

Out[14]:

In [15]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Accuracycompare2.png?raw=true')

Out[15]:

Overall Classifier (Best Classifier 1 + Best Classifier 2)¶

In [16]:

Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ComparisonBetweenSVMand2Step_resize.png?raw=true')

Out[16]:

Performance of the two-step approach versus the normal approach¶

Overall accuracy for the testing set¶

Testing Accuracy with SVM : 0.791
Testing Accuracy with 2-step appraoch: 0.847

(1) 2 step approach is better in test accuracy overall
(2) Based on the confusion matrix plot, however, we cannot conclude that the two-step approach is better in predicting every cuisine

Interpretation of the Confusion Matrix¶

Discovering the relationship between the cuisines according to different region and culture

In [17]:

IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/ConfusionMatrix.html', width=900, height=900)

Out[17]:

1. America section

9.3% of Cajun & Creole foods are predicted as Southern US foods
More than 8% of each Cajun Creole foods and Southern US foods are predicted as French or Italian foods since many people have come from Europe to America for a long time
the region of Cajun & Creole foods is Louisiana where was a colony of France, so that the 4.44% of Cajun & Creole foods are predicted as French foods

2. Asia section

Many countries' foods are related to Chinese foods. It is very possible because China has been influential to most countries for several thousand years
Vietnamese foods and Thai foods may be influential each because they are very close in geographically
Wrong predicted Filipino foods have a larger ratio of western foods because the Philippines was colonized by Spain and America
One thing which does not follow our assumption is that we cannot find any relationships between Vietnamese and French foods even though Vietnam was colonized by France for several decades.

3. Europe section

Many foods from European countries are predicted as Italian and French foods because European countries have influenced each other for thousands of years
Many wrongly predicted results as Italian or French foods may be caused by the number of samples (Italian has the most number of samples and the French category has the second largest number of samples)
3.6% of British foods were predicted as Indian foods. Because many Indian people have been moved to England, so this prediction can be explained

4. Africa section

We have only one country, Moroccan, for African data set.
Some Moroccan foods were predicted as European foods because Morocco is very close to Europe.

3. Discussion¶

1. Does the Best Model in Continent Level + Best Model in Country Level lead to Best Model overall?¶

No: Best + Second Best has a slightly higher testing accuracy
Possible reason: Log and SVM has a very small difference of the auc values (less than 0.07%) in the ROC plot, and we used the macro-averaging for multi-class classification
$\;\;\;\;\;\;$ →Different numbers of binarized samples should be penalized by the weights to correctly account for the imbalance of the sample numbers

2. How about doing the PCA to the entire large dataset?¶

Did a Multiple Correspondence Analysis (MCA) for an entire dataset
1. Half of the principal components are needed to explain 99% variance of the binary data (X)
2. Using these reduced features, we did a logistic regression on the entire dataset
$\;\;\;\;$ → The test accuracy was 0.73, which is lower than the two-step approach
$\;\;\;\;$ → For the large binary recipe dataset, Feature Extraction does not play a big role in efficiently reducing the dataspace

4. Conclusion¶

1. We achieved high accuracy in classifying the cuisines with the two-step classifiers

2. We believe this project can benefit people who love food and are willing to learn about the culture behind different recipes

3. We hope our model can be used in the future as a useful source for the people who are interested in discovering the relationship between the cuisines and the lifestyle of the people according to different region and culture

Reference¶

Data Source:

https://www.kaggle.com/kaggle/recipe-ingredients-dataset#test.json

Publication:

Kotsiantis SB. Supervised Machine Learning: A Review of Classification Techniques. Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies; 2007.
Vapnik V, Kotz S. Estimation of Dependences Based on Empirical Data. Springer; 2006.
Vapnik V. The Nature of Statistical Learning Theory. Springer New York; 1999.
Jonathon Shlens. A Tutorial on Principal Component Analysis. arXiv:1404.1100v1; 2014.
Herve Abdi & Dominique Valentin. Multiple Correspondence Analysis; 2007.
Howard Bergman et al. Correspondence analysis is a useful tool to uncover the relationships among categorical variables; 2010.

Others:
https://www.researchgate.net/post/Machine_learning_if_proportion_of_number_of_cases_in_different_class_in_training_set_matters https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html
http://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/
https://towardsdatascience.com/a-complete-guide-to-an-interactive-geographical-map-using-python-f4c5197e23e0
https://bokeh.pydata.org/en/latest/docs/gallery/unemployment.html
http://www.freeworldmaps.net

Supplement:
https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/MainCode-RecipeClassification_CX4240_Summer2019.ipynb