Algorithm to remove words in corpus with small occurrence

https://stackoverflow.com/questions/21941948

14-10-2022
|

Question

I have a large (+/- 300,000 lines) dataset of text fragments that contain some noisy elements. With noisy I mean words of slang, type errors, etc… I wish to filter out these noisy elements to have a more clean dataset.

I read some papers that propose to filter these out by keeping track of the occurrence of each word. By setting a treshold (eg. less than 20) we can assume these words are noise and thus can safely be removed from the corpus.

Maybe there are some libraries or algorithms that do this in a fast and efficient way. Ofcourse I tried it myself first but this is EXTREMELY slow!

So to summarize, I am looking for an algorithm that can filter out words, in a fast and efficient way, that occur less than a particular treshold. Maybe I add a small example:

This is just an example of whaat I wish to acccomplish.

The words 'whaat' and 'acccomplish' are misspelled and thus likely to occur less often (If we assume to live in a perfect world and typos are rare …). I wish to end up with

This is just an example of I wish to.

Thanks!

PS: If possible, I'd like to have an algorithm in Java (or pseudo-code so I can write it myself)

Solution

I think you are over complicating it with the approach suggested in comments.

You can do it with 2 passes on the data:

Build a histogram: Map<String,Integer> that counts number of occurances
For each word, print it to the new 'clean' file if and only if map.get(word) > THRESHOLD

As a side note, if any - I think a fixed threshold approach is not the best choice, I personally would filter words that occure less than MEAN-3*STD where MEAN is the average number of words, and STD is the standard deviation. (3 standard deviations mean you are catching words that are approximately out of the expected normal distribution with probability of ~99%). You can 'play' with the constant factor and find what best suits your needs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow