Question

I'm interested in classifying recipes programmatically based on a statistical analysis of various properties of the recipe. In other words, I want to classify a recipe as Breakfast, Lunch, Dinner or Dessert without any user input.

The properties I have available are:

  1. The recipe title (such as chicken salad)
  2. The recipe description (arbitrary text describing the recipe)
  3. The cooking method (the steps involved in preparing this recipe)
  4. Prep and cook times
  5. Each ingredient in the recipe, and its amount

The good news is I have a sample set of about 10,000 recipes that are already classified, and I can use these data to teach my algorithm. My idea is to look for patterns, such as if the word syrup appears statistically more frequently in breakfast recipes, or any recipe that calls for over 1 cup of sugar is 90% likely to be a dessert. I figure if I analyze the recipe across several dimensions, and then tweak the weights as appropriate, I can get something that's decently accurate.

What would be some good algorithms to investigate while approaching this problem? Would something like k-NN be helpful, or are there ones betters suited to this task?

Was it helpful?

Solution

Try various well known machine learning algorithms. I would suggest first using a Bayesian Classifier, since it is easy to implement and often works fairly well. If this does not work, then try something more complex, e.g. Neural Nets or SVMs.

The main Problem will be deciding on a set of features as input into your method. For this you will should look at which information is unique. For example if you have a recipe titled "Chicken Salad" the "chicken" part will not be of much interest because it is also present in the ingredients and simpler to gather from there. So you should try to find a set of keywords which are giving new information (i.e. the Salad part). Try to find a good set of keywords for this. This probably can be automatized somehow, but more likely you will be better of if you do it by hand, since it only needs to be done once.

The same goes for the description. Finding the correct set of features is always the hardest part for such a task.

Once you have your set of features, just train your algorithm on them and see how well it does. If you do not have much experience with Machine Learning have a look at the different methods to correctly test a ML algorithm (e.g. Leave N out testing etc).

OTHER TIPS

If I were to do it, I would try to do it like suggested by LiKao. I would first focus on the ingredients. I would establish a dictionnary of the words appearing in the Ingredients sections of the recipes, and cleanup the list in a supervised way to remove non-ingredient terms such as quantities and units.

Then I would resort to the Bayes theorem: your database allows you to compute the probability of having Eggs in a Breakfast and in a Dinner...; you will precompute those a priori probabilities. Then given an unknown recipy containing both Eggs and Marmalade, you can compute the probability of the meal being a Breakfast, a posteriori.

You can later enrich with other terms and/or taking quantities into account (number of Eggs per person)...

I think NN is probably an overkill for this. I would try classifying using a single perceptron "network" for each type of meal(Breakfast,Dinner), and let it go over the input and adjust the weight vector. every meaningful word found in the dataset can be the inputs of the network.. I would expect that to be enough for your needs. I used this method successfully to classify text before.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top