How to prune low frequency and high frequency words from a dataset?

https://stackoverflow.com/questions/21499768

05-10-2022
|

문제

Is there any tool available with which i can prune high frequency and low frequency terms from my dataset ?

해결책

A commonly used algorithm for this would be Grubbs' test. I don't really know an implementation in Java but if you would be willing to do the preprocessing in a different language, then there is the outliers package in R containing amongst others the Grubbs' test. To eliminate multiple outliers you can just repeatedly apply Grubbs' test.

Edit:

I just saw that I missed the text classification tag. If you just want to keep too frequent terms from skewing your results, maybe TF-IDF could be interesting to you. This of course does not reduce dimensionality.

다른 팁

Stop words are a common technique to eliminate (very) high frequency words in natural language processing.

Low-frequency words are usually interesting. Do you actually want to eliminate them?

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow