Classify the geographic origins of recipes based on the ingredients used
Cuisines are different across different countries and are primarily affected by geographic conditions, such as local climate, religion, and trade. Therefore, classifying the cuisines can be used to improve our understanding of each culture and lifestyle
from IPython.display import Image, HTML, IFrame
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Foodworld.png?raw=true')
1. Get the data from Kaggle (Use 'pandas' to read json file into pandas dataframe)
2. Data Preprocessing: Construct a binary matrix X (39774 x 6714)
$\;\;\;\;\;\;$2-1) 39774 recipes from 20 countries and the total 6714 ingredients for the recipes
$\;\;\;\;\;\;$2-2) For recipe $i$, if ingredient $j$ is used, $X_{i,j}$ equals to 1, and if ingredient $j$ is not used, $X_{i,j}$ equals to 0
3. Train/Test Split(0.8, 0.2) of binary matrix X and the label y
4. Model Training and Selection: Use 'Scikit learn' to help train the classfiers
$\;\;\;\;\;\;$ 1. Perceptron
$\;\;\;\;\;\;$ 2. Logistic Regression
$\;\;\;\;\;\;$ 3. Linear SVM
$\;\;\;\;\;\;$ 4. ANN
5. Evaluate the model based on ROC Curves
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/GeoChart(Country).html', width=970, height=650)
img = 'https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/MainWorkflow.png?raw=true'
Image(url=img)
1. The imbalance of the sample matters in statistical estimation technique and can lead to classifier bias
2. The matrix is too large and takes a lot of time for running (for ANN, it took over 2 hours for the entire dataset)
How can we reduce the sample imbalance problem?
Take a Two-step Approach:
What if we classify first by the continents and then classify the country?
We assume that our two-step approach would not lead to the classification bias since we know that the food varies according to the region (Food and the region is highly correlated)
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Dendogram(country).PNG?raw=true')
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Dendogram(continent).PNG?raw=true')
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/GeoChart(Continent).html', width=970, height=650)
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/PieChart(Country).html', width=800, height=400)
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/PieChart(Continent).html', width=800, height=400)
We tried to 'Balance' class_weight (putting less weights on the majority class instances)
$\;\;$→ Results in slightly (~0.1%) higher training accuracy but lower testing accuracy than the normal logistic regression
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Logistic(Weight_balanced_compare).png?raw=true')
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Workflow3.png?raw=true')
For example in our model, our model should be able to classify the spaghetti to the Italian
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Workflow4.png?raw=true')
$\;\;\;\;\;\;$1. Use the ROC curves to determine the best model in each level
$\;\;\;\;\;$2. Combine the best model on the continent level and the country level
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ROCCurvesContinent.png?raw=true')
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ROCCurvesCountries.png?raw=true')
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Accuracycompare1.png?raw=true')
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Accuracycompare2.png?raw=true')
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ComparisonBetweenSVMand2Step_resize.png?raw=true')
Testing Accuracy with SVM : 0.791
Testing Accuracy with 2-step appraoch: 0.847
(1) 2 step approach is better in test accuracy overall
(2) Based on the confusion matrix plot, however, we cannot conclude that the two-step approach is better in predicting every cuisine
Discovering the relationship between the cuisines according to different region and culture
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/ConfusionMatrix.html', width=900, height=900)
1. America section
2. Asia section
3. Europe section
4. Africa section
No: Best + Second Best has a slightly higher testing accuracy
Possible reason: Log and SVM has a very small difference of the auc values (less than 0.07%) in the ROC plot, and we used the macro-averaging for multi-class classification
$\;\;\;\;\;\;$ →Different numbers of binarized samples should be penalized by the weights to correctly account for the imbalance of the sample numbers
Did a Multiple Correspondence Analysis (MCA) for an entire dataset
1. Half of the principal components are needed to explain 99% variance of the binary data (X)
2. Using these reduced features, we did a logistic regression on the entire dataset
$\;\;\;\;$ → The test accuracy was 0.73, which is lower than the two-step approach
$\;\;\;\;$ → For the large binary recipe dataset, Feature Extraction does not play a big role in efficiently reducing the dataspace
1. We achieved high accuracy in classifying the cuisines with the two-step classifiers
2. We believe this project can benefit people who love food and are willing to learn about the culture behind different recipes
3. We hope our model can be used in the future as a useful source for the people who are interested in discovering the relationship between the cuisines and the lifestyle of the people according to different region and culture
Data Source:
https://www.kaggle.com/kaggle/recipe-ingredients-dataset#test.json
Publication:
Others:
https://www.researchgate.net/post/Machine_learning_if_proportion_of_number_of_cases_in_different_class_in_training_set_matters
https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html
http://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/
https://towardsdatascience.com/a-complete-guide-to-an-interactive-geographical-map-using-python-f4c5197e23e0
https://bokeh.pydata.org/en/latest/docs/gallery/unemployment.html
http://www.freeworldmaps.net