Question

So I have a small corpus of about 30k documents and about 50 documents in this corpus are in other languages (Persian, Chinese, Arabic, German, Spanish etc). I will be using this corpus for training a machine learning model.

Now the question is: How should these non-English documents be treated?

  1. Should I exclude them from the final corpus and from training the model?
  2. or should I manually translate them (Requesting natives from each language to translate it for me) and include them in the final corpus?
  3. or should I use Google translate/DeepL to translate these non-English documents into English and then include them in the final corpus?

Each document in the corpus under question is not larger than 500 words each.

Any help or hint will be appreciated.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top