Python: Whoosh seems to return incorrect results

https://stackoverflow.com//questions/25087290

02-01-2020
|

Вопрос

This code is straight from Whoosh's quickstart docs:

import os.path
from whoosh.index import create_in
from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT
from whoosh.index import open_dir
from whoosh.query import *
from whoosh.qparser import QueryParser

#establish schema to be used in the index
schema = Schema(title=TEXT(stored=True), content=TEXT,
                path=ID(stored=True), tags=KEYWORD, icon=STORED)

#create index directory
if not os.path.exists("index"):
    os.mkdir("index")

#create the index using the schema specified above
ix = create_in("index", schema)

#instantiate the writer object
writer = ix.writer()

#add the docs to the index
writer.add_document(title=u"My document", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                    path=u"/c", tags=u"short", icon=u"/icons/book.png")

#commit those changes
writer.commit()

#identify searcher
with ix.searcher() as searcher:

    #specify parser
    parser = QueryParser("content", ix.schema)

    #specify query -- try also "second"
    myquery = parser.parse("is")

    #search for results
    results = searcher.search(myquery)

    #identify the number of matching documents
    print len(results)

I have merely passed a value--namely, the verb "is"--to the parser.parse() call. When I run this, however, I get results of length zero, rather than the expected results of length two. If I replace "is" with "second", I get one result, as expected. Why doesn't the search using "is" yield a match, though?

Edit

As @Philippe points out, the default Whoosh indexer removes stop words, hence the behavior described above. If you want to retain stop words, you can specify which analyzer you wish to use when indexing a given field within an index, and you can pass your analyzer a parameter to refrain from stripping stop words; e.g.:

schema = Schema(title=TEXT(stored=True, analyzer=analysis.StandardAnalyzer(stoplist=None)))

Решение

A stop word filter is applied by the default text analyzer: https://bitbucket.org/mchaput/whoosh/src/999cd5fb0d110ca955fab8377d358e98ba426527/src/whoosh/analysis/filters.py?at=default#cl-41

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow