Question

I'm using Crate for a german news site and use fulltext search extensively (which generally works well enough). However I was wondering about stop words usage. I'd like to minimize this since search is plenty fast so I'm not too worried about performance. Is this advisable? And: which stop words are actually getting used by default-- is there a list of builtin stop words somewhere?

Was it helpful?

Solution

the built-in words are actually from lucene and are inside the lucene-analyzers-common*.jar file inside the lib directory of the crate tarball.

If you extract the contents of the jar file you'll find a file called german_stop.txt which contain all german stop words.

There is also a set of words inside the lucene source code which is marked as deprecated so I assume it's no longer in use. These words would be:

"einer", "eine", "eines", "einem", "einen",
"der", "die", "das", "dass", "daß",
"du", "er", "sie", "es",
"was", "wer", "wie", "wir",
"und", "oder", "ohne", "mit",
"am", "im", "in", "aus", "auf",
"ist", "sein", "war", "wird",
"ihr", "ihre", "ihres",
"als", "für", "von", "mit",
"dich", "dir", "mich", "mir",
"mein", "sein", "kein",
"durch", "wegen", "wird"

I think the default is good enough, unless you run into troubles with some specific words I don't see a reason to tweak the stop words.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top