How to prune low frequency and high frequency words from a dataset?

https://stackoverflow.com/questions/21499768

05-10-2022
|

Frage

Is there any tool available with which i can prune high frequency and low frequency terms from my dataset ?

Lösung

A commonly used algorithm for this would be Grubbs' test. I don't really know an implementation in Java but if you would be willing to do the preprocessing in a different language, then there is the outliers package in R containing amongst others the Grubbs' test. To eliminate multiple outliers you can just repeatedly apply Grubbs' test.

Edit:

I just saw that I missed the text classification tag. If you just want to keep too frequent terms from skewing your results, maybe TF-IDF could be interesting to you. This of course does not reduce dimensionality.

Andere Tipps

Stop words are a common technique to eliminate (very) high frequency words in natural language processing.

Low-frequency words are usually interesting. Do you actually want to eliminate them?

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow