Autocorrect a document corpus

https://stackoverflow.com/questions/22492155

autocomplete
nlp
machine-learning
text-analysis
nltk

17-06-2023
|

Вопрос

I have an approximately 6GB sized document corpus of mostly user generated content on mobile platforms. Due to the nature of origin of this corpus, it is rife with misspelled, abbreviated and truncated words. Is there a way i could autocorrect these words to the nearest English language word?

Решение

This might be fun to look at, seen that you tagged your question with machine learning:

http://norvig.com/spell-correct.html

It's a fascinating read. On the other hand, if you are not looking to tinker, a better one might be Enchant, have a look at

https://pypi.org/project/pyenchant/

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow