Mining Structured Data From Recipes

Entity extractionNatural language processingMachine learningK-Means clusteringStanford CoreNLPConditional random fields

Challenge

A food-tech startup based in India needed to implement food recommendations for their users, but had too little information about their users’ preferences. There was a large database of user-generated English language recipes that could provide a plethora of previously unavailable data, if extracted automatically, accurately and at a sufficient scale.

Solution

We leveraged machine learning tools to build an API that takes raw, plain-text recipes and extracts all kinds of structured data: ingredients, quantities and units, step-by-step instructions, cooking temperatures or required appliances and utensils.

Technologies

  • To separate ingredients list from instructions, we used K‐Means clustering from scikit-learn library fed by “bag of POS tags” to model a grammatical spectrum of each sentence/line, combined with extra features such as “starts with number”, “starts with verb”, “ends with period”.
  • To extract ingredients, quantities and units, we leveraged a specialized New York Times Ingredient Phrase Tagger with conditional random fields model trained on 130,000 documents from NYT Cooking section, plus our own dataset featuring European units, such as grams.

Instructions are parsed using Stanford CoreNLP pipeline, combining multiple approaches:

  • CRF NER model trained on CoNLL2003 dataset.
  • Custom regex rules (e.g. for extracting temperature-related data).
  • Domain‐specific gazetteers for various new entity types such as utensils and appliances. These can be extended at any me without re‐training any model.

System Architecture

System architecture

Example Input

BREAKFAST MUFFINS TO-GO

3 tablespoons olive oil  
2 garlic cloves, finely chopped  
1 red onion, finely chopped  
1 zucchini, grated and squeezed to remove excess moisture  
1⁄4 cup chives, finely chopped  
1 cup cottage cheese  
1⁄2 cup gluten-free flour  
1 teaspoon baking powder  
1 pinch salt and pepper  
12 eggs  
1 avocado, cut into quarters and fanned  
kale chips  
2 tablespoons honey  
1⁄2lemon, juice and zest  
2 tablespoons sriracha sauce

Preheat the oven to 355 F and line a baking sheet with baking paper. In a bowl combine 2 Tbsp of Olive oil and the kale with salt and pepper, spread evenly on to tray and cook for 12-15 minutes or until crispy. Set aside to cool for 5 minutes. In a frypan on medium-high heat add remaining olive oil and caramelize the red onion for 4 minutes before adding garlic and cooking for a further 3 minutes, set aside. In a bowl whisk your eggs then add in your onion, zucchini, chives, cheese, flour, baking powder and season with salt & pepper. Grease a 6 tin muffin tray and evenly distribute egg mix into each. Bake in the oven for 8-10 minutes or until egg is cooked. Remove and allow to cool for 5 minutes. Meanwhile in a bowl combine the siracha, honey and lemon juice. To finish your muffin top with the avocado 
drizzle with the Sriracha honey and top with a kale chip.

API Response

{  
   "title":"BREAKFAST MUFFINS TO-GO",
   "ingredients":[  
      {  
         "qty":"3",
         "unit":"tablespoon",
         "name":"olive oil",
         "input":"3 tablespoons olive oil"
      },
      {  
         "qty":"2",
         "name":"garlic",
         "unit":"clove",
         "other":",",
         "comment":"finely chopped",
         "input":"2 garlic cloves, finely chopped"
      },
      {  
         "qty":"1",
         "name":"red onion",
         "other":",",
         "comment":"finely chopped",
         "input":"1 red onion, finely chopped"
      },
(...)
  "instructions":[  
      {  
         "text":"Preheat the oven to 355 F and line a
     baking sheet with baking paper.",
         "entities":[  
            {  
               "type":"APPLIANCE",
               "text":"oven",
               "metadata":[  
                  {  
                     "index":3,
                     "originalText":"oven",
                     "lemma":"oven",
                     "pos":"NN",
                  }
               ]
            },
            {  
               "type":"TEMPERATURE",
               "text":"355 F",
               "metadata":[  
                  {  
                     "index":5,
                     "lemma":"355",
(...)

Read More Case Studies