Predicting correct match of French to English food descriptions
-
10-12-2020 - |
题
I have a training and test set of food descriptions pairs (please, see example below)
First name in a pair is a name of food in French
and second word is this food description in English.
Traing set has also a trans
field that is True for correct descriptions
and False for wrong descriptions.
The task is to predict trans
field in a test set, in other words to predict
wich food description is corect and which is wrong.
dishes = [{"fr":"Agneau de lait", "eng":"Baby milk-fed lamb", "trans": True},
{"fr":"Agrume", "eng":"Blackcurrants", "trans": False},
{"fr":"Algue", "eng":"Buttermilk", "trans": False},
{"fr":"Aligot", "eng":"potatoes mashed with fresh mountain cheese", "trans": False},
{"fr":"Baba au rhum", "eng":"Star anise", "trans": True},
{"fr":"Babeurre", "eng":"seaweed", "trans": False},
{"fr":"Badiane", "eng":"Sponge cake (often soaked in rum)", "trans": False},
{"fr":"Boeuf bourguignon", "eng":"Créole curry", "trans": False},
{"fr":"Carbonade flamande", "eng":"Beef Stew", "trans": True},
{"fr":"Cari", "eng":"Beef stewed in red wine", "trans": False},
{"fr":"Cassis", "eng":"citrus", "trans": False},
{"fr":"Cassoulet", "eng":"Stew from the South-West of France", "trans": True},
{"fr":"Céleri-rave", "eng":"Celery root", "trans": True}]
df = pd.DataFrame(dishes)
fr eng trans
0 Agneau de lait Baby milk-fed lamb True
1 Agrume Blackcurrants False
2 Algue Buttermilk False
3 Aligot potatoes mashed with fresh mountain cheese False
4 Baba au rhum Star anise True
5 Babeurre seaweed False
6 Badiane Sponge cake (often soaked in rum) False
7 Boeuf bourguignon Créole curry False
8 Carbonade flamande Beef Stew True
9 Cari Beef stewed in red wine False
10 Cassis citrus False
11 Cassoulet Stew from the South-West of France True
12 Céleri-rave Celery root True
I think to solve this as text classification problem, where text is a concatenation of French name and English description embeddings.
Questions:
- Which embeddings to use and how concatenate them?
- Any other ideas on approach to this problem? BERT?
Update:
How about the following approach:
- Translate (with BERT?) French names to English
- Use embeddings to create two vectors: v1 - translated English vector and v2 - English description vector (from data set)
- Compute v1 - v2
- Create new data set with two columns:
v1 - v2
andtrans
- Train classifier on this new data set
Update 2:
It looks like cross-lingual classification may be the right solution for my problem:
https://github.com/facebookresearch/XLM#iv-applications-cross-lingual-text-classification-xnli
It is not clear yet from the description given on the page with the link above, where to fit my own training data set and how to run classifier on my test set. Please help to figure this out. It would be ideal to find end-to-end example / tutorial on cross-lingual classification.
解决方案
As you suspected, the best approach would be to take a massive multilingual pretrained language model and make use of the information about French and English that it has already learned. You can read about some good options here.
The basic idea is to train a new, lightweight network to make predictions based on the output from the pretrained model; its usual to just have a single layer feed forward network for this “fine-tuning”. Some implementations will already have this conveniently coded up for you, so check the documentation for whatever you decide to use!
Your problem is specifically a sentence pair classification problem, and there is a tutorial for that here. Pay close attention to the data processing phase of the tutorial. Overall, the differences you need to apply to what the tutorial describes are
- You need to use multilingual BERT
- You need to prepare your data exactly as the tutorial says about how to set up your data, but use your own snippets in place of the sentence pairs