I want to start studying the field of machine translation [closed]

https://datascience.stackexchange.com/questions/80993

13-12-2020
|

Pergunta

I've studied Japanese language and literature and passed some linguistic courses and now as for my masters, I want to study natural language processing and especially machine translation. so I tried taking some data science courses online and I'm now a little bit familiar with data science but I know literally nothing about machine translation. so, long story short, I need to write a proposal in the machine translation field (university requirements) but I don't know where to start reading about machine translation. I tried to read some essays but the level was too high for me, I didn't understand a single thing. I'd be so thankful if you could guide me through this journey. thank you ^__^

Solução

As you certainly know, Machine Translation (MT) is a very challenging and useful task in the domain of Natural Language Processing (NLP). As such it is a very specialized research domain but also a very active area of research, and a very competitive one (in particular due to commercial applications, obviously).

So there's a massive amount of research being done and a massive amount of resources, but in order to gain real expertise in MT one needs to acquire quite a lot of background knowledge. Let's be clear: a beginner level in data science is not sufficient to understand state of the art MT. Typically one needs not only a good level in statistics and programming, but also knowledge of the recent progress in MT: the old statistical MT approach (e.g. Moses) has been replaced with better Neural MT approaches.

A slightly less ambitious objective would be to study the limitations of current MT systems, since this doesn't require understanding how they work. Note that even simply training a state of the art model using existing software is not trivial, and requires quite a lot of computational resources. I'd suggest looking at the resources and papers published at the Workshop on Machine Translation (check also the previous years). Note also that there are many sub-tasks related to MT to look at:

model design
evaluation metrics
building training corpora
quality estimation
post-editing

The WMT Shared Tasks offer datasets for these different sub-tasks. Reading the overview paper for a task is a good way to get an idea of what it is and how it's done.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange