Predicting correct match of French to English food descriptions

https://datascience.stackexchange.com/questions/72105

10-12-2020
|

题

I have a training and test set of food descriptions pairs (please, see example below) First name in a pair is a name of food in French and second word is this food description in English. Traing set has also a trans field that is True for correct descriptions and False for wrong descriptions. The task is to predict trans field in a test set, in other words to predict wich food description is corect and which is wrong.

dishes = [{"fr":"Agneau de lait", "eng":"Baby milk-fed lamb", "trans": True},
{"fr":"Agrume", "eng":"Blackcurrants", "trans": False},
{"fr":"Algue", "eng":"Buttermilk", "trans": False},
{"fr":"Aligot", "eng":"potatoes mashed with fresh mountain cheese", "trans": False},
{"fr":"Baba au rhum", "eng":"Star anise", "trans": True},
{"fr":"Babeurre", "eng":"seaweed", "trans": False},
{"fr":"Badiane", "eng":"Sponge cake (often soaked in rum)", "trans": False},
{"fr":"Boeuf bourguignon", "eng":"Créole curry", "trans": False},
{"fr":"Carbonade flamande", "eng":"Beef Stew", "trans": True},
{"fr":"Cari", "eng":"Beef stewed in red wine", "trans": False},
{"fr":"Cassis", "eng":"citrus", "trans": False},
{"fr":"Cassoulet", "eng":"Stew from the South-West of France", "trans": True},
{"fr":"Céleri-rave", "eng":"Celery root", "trans": True}]

df = pd.DataFrame(dishes)

    fr                  eng                                          trans
0   Agneau de lait      Baby milk-fed lamb                           True
1   Agrume              Blackcurrants                                False
2   Algue               Buttermilk                                   False
3   Aligot              potatoes mashed with fresh mountain cheese   False
4   Baba au rhum        Star anise                                   True
5   Babeurre            seaweed                                      False
6   Badiane             Sponge cake (often soaked in rum)            False
7   Boeuf bourguignon   Créole curry                                 False
8   Carbonade flamande  Beef Stew                                    True
9   Cari                Beef stewed in red wine                      False
10  Cassis              citrus                                       False
11  Cassoulet           Stew from the South-West of France           True
12  Céleri-rave         Celery root                                  True

I think to solve this as text classification problem, where text is a concatenation of French name and English description embeddings.

Questions:

Which embeddings to use and how concatenate them?
Any other ideas on approach to this problem? BERT?

Update:

How about the following approach:

Translate (with BERT?) French names to English
Use embeddings to create two vectors: v1 - translated English vector and v2 - English description vector (from data set)
Compute v1 - v2
Create new data set with two columns: v1 - v2 and trans
Train classifier on this new data set

Update 2:

It looks like cross-lingual classification may be the right solution for my problem:

https://github.com/facebookresearch/XLM#iv-applications-cross-lingual-text-classification-xnli

It is not clear yet from the description given on the page with the link above, where to fit my own training data set and how to run classifier on my test set. Please help to figure this out. It would be ideal to find end-to-end example / tutorial on cross-lingual classification.

解决方案

As you suspected, the best approach would be to take a massive multilingual pretrained language model and make use of the information about French and English that it has already learned. You can read about some good options here.

The basic idea is to train a new, lightweight network to make predictions based on the output from the pretrained model; its usual to just have a single layer feed forward network for this “fine-tuning”. Some implementations will already have this conveniently coded up for you, so check the documentation for whatever you decide to use!

Your problem is specifically a sentence pair classification problem, and there is a tutorial for that here. Pay close attention to the data processing phase of the tutorial. Overall, the differences you need to apply to what the tutorial describes are

You need to use multilingual BERT
You need to prepare your data exactly as the tutorial says about how to set up your data, but use your own snippets in place of the sentence pairs

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange