End to End Recipe Cuisine Classification

Will the ML do better than 76% (a Dummy classifier where every recipe is classified as Italian)Multinomial Naive BayesUsing Multinomial Naive Bayes, which is simple and fast got almost “perfect” f1_weighted scores for both cuisines on the training and testing sets.

The scores were consistent with Stratified 3 Fold cross validation{‘fit_time’: array([0.

03857684, 0.

03397703, 0.

03488088]), ‘score_time’: array([0.

01746416, 0.

01721191, 0.

01700783]), ‘test_score’: array([0.

98459752, 0.

98551861, 0.

98093119]), ‘train_score’: array([0.

98549045, 0.

98641227, 0.

9873263 ]The Multiclass caseThe work is in this Jupyter Notebook.

The 6 classes are Chinese, French, Indian, Italian,Mexican, and Thai.

The final results are in the summary table below.

Related WorkSome of the related work I found on recipe cuisine classification based on parsing ingredient text had similiar findings to my work.

All of them did not have a balanced class dataset.

This team had the best results with logistic regression due to their large training set.

They found that upsampling the classes with small instance sizes did not yield better results.

This team used random forest and had better training scores vs regression on the training set, but overfit as the test results were worse.

They had an accuracy score of 77.

87% on 39,774 recipes.

Neither team reports individual scores for each cuisine, but the overall accuracy on all cuisines.

Initial AnalysisI found that using bigrams gave the overall score some lift, and using count vectorizer vs the TFIDF has better results.

This was expected based on the related work results and my domain knowledge of some of the cuisines.

(Page 4 of this paper gives an explanation of why tf idf was worse than count vectorizer for a recipe classification problem.

) Due to these results, I changed myparse_recipes function to consider helpful two word ingredients into the document string and go with BagofWords.

Random forest overfits the data and Logistic Regression has a better results than Multinomial Naive Bayes due to the increase of data.

I also checked the F1 scores by implementing each model and looking at the confusion matrix.

The results are below and select models and the code are left in the Jupyter notebook.

results for baseline (most frequent) classifier{ ‘fit_time’: array([0.

14675808, 0.

13814402, 0.

13909221]), ‘score_time’: array([0.

06582284, 0.

05454063, 0.

06289887]), ‘test_score’: array([0.

45315067, 0.

45420838, 0.

45473865]), ‘train_score’: array([0.

4544734 , 0.

4539436 , 0.

45367905])}results for MNB count vector{ ‘fit_time’: array([0.

13690996, 0.

13612795, 0.

14776587]), ‘score_time’: array([0.

05504894, 0.

05874705, 0.

05910468]), ‘test_score’: array([0.

83512138, 0.

85497245, 0.

82022192]), ‘train_score’: array([0.

92489217, 0.

92269654, 0.

9312881 ])}results for LR count vector{ ‘fit_time’: array([1.

87117004, 1.

77050972, 1.

72172022]), ‘score_time’: array([0.

05725479, 0.

05667615, 0.

06186581]), ‘test_score’: array([0.

86138298, 0.

87050159, 0.

86625949]), ‘train_score’: array([0.

99856009, 0.

99568578, 0.

99532393])}results for LR tfidf vectorizer{ ‘fit_time’: array([1.

53208613, 1.

15100121, 1.

14650798]), ‘score_time’: array([0.

08225894, 0.

06687689, 0.

06154084]), ‘test_score’: array([0.

82739293, 0.

84396535, 0.

82896392]), ‘train_score’: array([0.

90109714, 0.

89225469, 0.

90134512])}results for RF count vector{ ‘fit_time’: array([2.

1868701 , 2.

394063 , 2.

15287924]), ‘score_time’: array([0.

13759899, 0.

13965178, 0.

13798285]), ‘test_score’: array([0.

80341796, 0.

81544372, 0.

81283775]), ‘train_score’: array([0.

99856015, 0.

996409 , 0.

99712821])}Logistic RegressionAs the data size grows, Logistic Regression performs better than Naive Bayes.

This is also true for the recipe dataset.

Looking at the individual scores for each cuisine, there is overfitting, and it is not clear if class imbalance is a problem.

For example, there are 75 total Mexican recipes with F1 score = 0.

73, while there are 387 French recipes with F1 score 0.

63.

Logistic Regression test/train confusion matrix heat mapsSome observations:French mostly gets misclassified most as ItalianMexican gets misclasified most as ItalianThai as gets misclassified as Chinese followed by Indian.

Mexican as Italian is slightly unexpected but I can see why some of the printed cases.

— One example where a mexican recipe was as italian:plum oregano fish bouillon banana garlic tomato bouillon butter plum tomato onion olive oil black pepper caper olive bay salt oil pepper white fish pickled green olive cinnamonThe scores for the training set are higher than the test, and I tried to deal with this overfitting and possible class imbalance contributions.

Class weights in loss functionThe class_weight='balanced' argument will weigh classes inversely proportional to their frequency in the sci-kit learn Logistic Regression class.

There was some improvement with the classes that did not do as well.

Logistic Regression (class weighted loss) test/train confusion matrix heat mapsOversampling and UndersamplingI used a Python package called Imbalance learn to implement over and undersampling, and find out if how much this would help or not help the scores of the cuisines.

Logistic Regression (oversampling) test/train confusion matrix heat mapsLogistic Regression (undersampling) test/train confusion matrix heat mapsBy comparison, oversampling did not make much difference except for reducing the Mexican score and French scores.

Undersampling had the worst performance.

Chi-Squared (χ²)Finally, using the χ² test for the ingredient features and selecting the top K significant ones is another method to deal with overfitting on the training set.

The training scores went down slightly, but the test scores did not improve compared to the other models like Class Weighted Loss Function Logistic Regression.

Logistic Regression (χ² 600 best) test/train confusion matrix heat mapsThese are the top 10 ingredient words sorted by chi2 with class frequency:fish: 1897.

8449503573233[(‘thai’, 48), (‘chinese’, 8), (‘indian’, 6), (‘french’, 5), (‘italian’, 4), (‘mexican’, 1)]cumin: 1575.

1841574113994[(‘indian’, 275), (‘italian’, 16), (‘mexican’, 9), (‘thai’, 6), (‘chinese’, 1), (‘french’, 1)]husk: 1318.

6409852859006[(‘mexican’, 12), (‘indian’, 3)]masala: 1236.

5322033898303[(‘indian’, 146)]cheese: 1155.

8931811338373[(‘italian’, 1074), (‘french’, 85), (‘indian’, 22), (‘mexican’, 9), (‘thai’, 3), (‘chinese’, 2)]fish sauce: 1119.

2711018711018[(‘thai’, 45), (‘french’, 2), (‘chinese’, 2)]turmeric: 999.

0619453767151[(‘indian’, 225), (‘thai’, 8), (‘italian’, 1)]peanut: 994.

6422201344235[(‘thai’, 42), (‘chinese’, 26), (‘indian’, 24), (‘italian’, 3)]lime: 991.

2019982362463[(‘thai’, 43), (‘indian’, 33), (‘mexican’, 17), (‘italian’, 8), (‘french’, 4), (‘chinese’, 3)]sesame: 958.

9888557397035[(‘chinese’, 77), (‘indian’, 8), (‘italian’, 7), (‘thai’, 7), (‘french’, 1)]The ingredent that had the highest score and most frequently appeared in the french was not until #66:gruyere: 291.

2451127819549[(‘french’, 17), (‘italian’, 2)]Deep LearningFor fun and curiosity, I also implemented two Deep Learning architectures with Keras.

Deep Learning plot models for (L) Dense network and (R) Multichannel 2-gram with embedding and Conv1DDeep Learning with Keras — densely connected networkUses the tokenized words.

French F1 score .

72, the best out of all, including the second DL implementation.

The validation score wavered after 3 Epochs.

Early stopping would save the state of the model at that point.

Deep Learning with Keras — Multichannel 2-gram with embedding and Conv1DI wanted to try using bigrams to see how much difference that would make with Deep Learning.

I found an implementation here that I used as a guide.

I didnt think RNN would do much, like word2vec, because the order of ingredients is irrelevant to the cuisine.

Deep Learning Multichannel 2-gram/embedding/Conv1D test/train confusion matrix heat mapsI found and expected that using 3 grams didn’t give much lift vs using only 1 and 2 grams.

This architecture did not do as well, overall, as the simpler DL densely connected network architecture.

Summary of ResultsConclusion summaryClass Weighted had top scores for Chinese, French, IndianItalian did the best with Conv1D and multigram embedding DL, followed by Class Weighted Logistic Regression.

Mexican did best with oversampling, followed by Class Weighted Logistic Regression.

The model, that I go with for the AWS deployment will be Class weighted loss function in Logistic Regression, the simplest and best overall.

The decision for business can be based on how perfect of a classifier do you want, based on how quickly it trains and performs, how much data you have, and how does this need to work in production.

Other ideas in addition to collecting more data is considering the quantity of an ingredient as part of the feature weight.

Experiment: Would more data help with overfitting, classes that do not score high, and class imbalance?I tried an experiment to test whether getting more data to balance the classes could make a difference.

This can be an important metric to quantify if getting more data is time consuming or expensive.

It was not clear based on the results I had at the beginning whether class imbalance contributes to low test scores.

I plotted what the test and training scores are, for each cuisine.

The class sizes start at 25 and continue in increments of 25.

If the size is less then the class count, then it is downsampled to set them all equal.

This is the case until the count goes over the class size.

Vertical colored lines mark the max class count on the x axis.

Thai, Mexican and Indian do the best with a class count of 50.

They both fluctuate at a lower range, and Mexican dips down towards the high end of the count.

Indian starts to go up in as the class count increases.

French does the best at 50 and 75, and then starts to go downwards.

Italian is not what I expected — it should be in the higher range of .

9, and I could have a bug in the code below.

Though this experiment didn’t help to show if collecting more data would be helpful, the results of all of the other attempts in this notebook show that collecting more data is the last choice left.

.. More details

Leave a Reply