I think you are over complicating it with the approach suggested in comments.
You can do it with 2 passes on the data:
- Build a histogram:
Map<String,Integer>
that counts number of occurances - For each word, print it to the new 'clean' file if and only if
map.get(word) > THRESHOLD
As a side note, if any - I think a fixed threshold approach is not the best choice, I personally would filter words that occure less than MEAN-3*STD
where MEAN
is the average number of words, and STD
is the standard deviation. (3 standard deviations mean you are catching words that are approximately out of the expected normal distribution with probability of ~99%). You can 'play' with the constant factor and find what best suits your needs.