Question

In PyLucene, there is a filter called StopFilter which filters tokens based on given stopwords. The example call is as follows:

result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)

It seems like it should be easy to replace the argument for the set of stop words, but this is actually a bit challenging:

>>> StopAnalyzer.ENGLISH_STOP_WORDS_SET

<Set: [but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]>

This is a Set, which is not able to be implemented:

>>> Set()

NotImplementedError: ('instantiating java class', <type 'Set'>)

It was suggested elsewhere to use a PythonSet, which comes with PyLucene, but it turns out that this is not an instance of a Set, and cannot be used with a StopFilter.

How can one give a StopFilter a new set of stop words?

Was it helpful?

Solution

I discovered the answer to this halfway through writing this question via this thread on the pylucene dev list:

http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201202.mbox/thread

You can define a StopFilter using a custom list as follows:

mystops = HashSet(Arrays.asList(['a','b','c']))
result = StopFilter(True, result, mystops)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top