Search Engine Stopwords - Best Practices [closed]

https://stackoverflow.com/questions/13600467

03-12-2021
|

Question

It is common practice to not index so called stop words when analyzing documents for a search engine. Stop words are common words, such as a, the, and this, that appear frequently in language. The idea is that if stop words are indexed, they take up too much space in the index and add little to the quality of the search results.

I would like to know if this is always the case.

In modern search engines, does indexing stop words make the index size explode? Or is it just a marginal increase.

Also, how does removing stop words affect phrase searches? Searching for "beatles" and "the beatles" seem to be two very different things.

I am building an app with elasticsearch, but this question is equally applicable to Solr, direct lucene, or any other variant.

Solution

The main problem with stop words is not the index size - but the quality of the answer. They tend to be dominant (have very high tf value and thus might make the results returned wrong), and not the size of the index.
In any case, indexing stop words does not increase the size of the index significantly (and it definetly does not "explode")
One way to overcome it is to use the stop words (and not omit them completely) when indexing n-grams. I don't know if it actually being done, but it definitely can help improve the returned results.

Also: stop words are not always* omitted. In sarcasm detectors, for example - it seems (empirically) stop words are very significant to the answer.

OTHER TIPS

I think all search engines handle this differently. You can read about these things at: http://searchenginewatch.com

But if you are just one guy who's building a (small) app, i don't think you should focus on these minor details and just leave out these words and focus on the more relevant words instead.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow