how to handle misspelled words in documents for text mining tasks?

https://stackoverflow.com/questions/4276500

28-09-2019
|

문제

I have a set of informal documents (couple of thousands) which I want to apply topic modeling (MALLET) on. The problem is, there are a considerable number of misspelled words in the documents. Most are intentional, such as short-forms and local lingo like `'juz' -> 'just', 'alr' -> 'already'. A couple of these variations exists, due to the different authors' peculiar styles of writing.

After feeding them to MALLET, I kinda bothered that one of the topics generated is actually a set of misspelled stopwords. I believe these words are mostly used in the small subset of documents from the same author, hence MALLET picked it up.

My question is, do I spell-check and correct these sets of misspelled words, and perhaps save the corrected text somewhere, before conducting further tasks on them? I suppose this would meant that I do need to manually verify the corrections before committing right? What would be the most "efficient" way to do this?

Or do I actually ignore these misspelled words?

해결책

What do you do with stopwords at the moment? If you are doing topic modelling then it would make sense to filter them out. If so, why don't you filter out these terms too?

[Edit in response to reply]

There is some research about handling stopwords within LDA in a more principled way. There are two papers that spring to mind:

[1] uses a term weighting scheme which apparently helps in a predictive task they set up, [2] uses a non-symmetric prior over the word distributions which apparently leads to a few topics which contain all the stop words, and other words common to the entire corpus.

It seems to me that the best way to automatically infer stop words and other non-topic words in LDA is still a research question.

다른 팁

I don't think we can answer that without knowing the impact of misspelled words or miscorrected misspelt words on the outcome of your topic modelling. So if you could give more information, that would be good.

However, I would have thought you wanted to correct them, at least where the correction is clearly the intent of the original author.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow