Remove common english words strategy

https://stackoverflow.com/questions/7532889

26-01-2021
|

Question

I want to extract relevant keywords from a html page.

I already stipped all html stuff, split the text into words, used a stemmer and removed all words appearing in a stop word list from lucene.

But now I still have alot of basic verbs and pronouns as most common words.

Is there some method or set of words in lucene or snowball or anywhere else to filter out all these things like "I, is , go, went, am, it, were, we, you, us,...."

Solution

It seems like a pretty simple application of inverse document frequency. If you had even a small corpus of say, 10,000 web pages, you could compute the probability of each word appearing in a document. Then pick a threshold where you think the words start to get interesting or contentful and exclude the words before that threshold.

Alternatively, this list looks good. http://www.lextek.com/manuals/onix/stopwords1.html

OTHER TIPS

You are looking for the term 'stopwords'. For Lucene, this is built in and you can add them in the StopWordAnalyzer.java (see http://ankitjain.info/ankit/2009/05/27/lucene-search-ignore-word-list/)

The tm package for R provides an interface through R for many common NLP tasks, and has an interface to Weka. It might be worth checking out. The documentation is here

Upon looking at your question more colsely, you are probably looking for the removeStopWords() function in the tm package.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow