Whoosh NestedChildren search not returning all results

https://stackoverflow.com/questions/23482631

16-07-2023
|

Вопрос

I'm making a search index which must support nested hierarchies of data. For test purposes, I'm making a very simple schema:

test_schema = Schema(
    name_ngrams=NGRAMWORDS(minsize=4, field_boost=1.2),
    name=TEXT(stored=True),
    id=ID(unique=True, stored=True),
    type=TEXT
)

For test data I'm using these:

test_data = [
    dict(
        name=u'The Dark Knight Returns',
        id=u'chapter_1',
        type=u'chapter'),
    dict(
        name=u'The Dark Knight Triumphant',
        id=u'chapter_2',
        type=u'chapter'),
    dict(
        name=u'Hunt The Dark Knight',
        id=u'chapter_3',
        type=u'chapter'),
    dict(
        name=u'The Dark Knight Falls',
        id=u'chapter_4',
        type=u'chapter')
]

parent = dict(
    name=u'The Dark Knight Returns',
    id=u'book_1',
    type=u'book')

I've added to the index all the (5) documents, like this

with index_writer.group():
    index_writer.add_document(
        name_ngrams=parent['name'],
        name=parent['name'],
        id=parent['id'],
        type=parent['type']
    )
    for data in test_data:
        index_writer.add_document(
            name_ngrams=data['name'],
            name=data['name'],
            id=data['id'],
            type=data['type']
        )

So, to get all the chapters for a book, I've made a function which uses a NestedChildren search:

def search_childs(query_string):
    os.chdir(settings.SEARCH_INDEX_PATH)
    # Initialize index
    index = open_dir(settings.SEARCH_INDEX_NAME, indexname='test')
    parser = qparser.MultifieldParser(
        ['name',
         'type'],
        schema=index.schema)
    parser.add_plugin(qparser.FuzzyTermPlugin())
    parser.add_plugin(DateParserPlugin())

    myquery = parser.parse(query_string)

    # First, we need a query that matches all the documents in the "parent"
    # level we want of the hierarchy
    all_parents = And([parser.parse(query_string), Term('type', 'book')])

    # Then, we need a query that matches the children we want to find
    wanted_kids = And([parser.parse(query_string),
                       Term('type', 'chapter')])
    q = NestedChildren(all_parents, wanted_kids)
    print q

    with index.searcher() as searcher:
        #these results are the parents
        results = searcher.search(q)
        print "number of results:", len(results)
        if len(results):
            for result in results:
                print(result.highlights('name'))
                print(result)
            return results

But for my test data, if I search for "dark knigth", I'm only getting 3 results when it must be 4 search results.

I don't know if the missing result is excluded for having the same name as the book, but it's simply not showing in the search results

I know that all the items are in the index, but I don't know what I'm missing here.

Any thoughts?

Решение

Turns out that I was using NestedChildren wrong. Here is the answer I get from Matt Chaput in Google Groups:

I'm making a search index which must support nested hierarchies of data.

The second parameter to NestedChildren isn't what you think it is.

TL;DR: you're using the wrong query type. Let me know what you're trying to do, and I can tell you how to do it :)

ABOUT NESTED CHILDREN

(Note, I found a bug, see the end)

NestedChildren is hard to understand, but hopefully I can try to explain it better.

NestedChildren is about searching for certain PARENTS, but getting their CHILDREN as the hits.

The first argument is a query that matches all documents of the "parent" class (e.g. "type:book"). The second argument is a query that matches all documents of the parent class that match your search criteria (e.g. "type:book AND name:dark").

In you example, this would mean searching for a certain book, but getting its chapters as the search results.

This isn't super useful on its own, but you can combine it with queries on the children to do complex queries like "show me chapters with 'hunt' in their names that are in books with 'dark' in their names":

# Find the children of books matching the book criterion
all_parents = query.Term("type", "book")
wanted_parents = query.Term("name", "dark")
children_of_wanted_parents = query.NestedChildren(all_parents, wanted_parents)

# Find the children matching the chapter criterion
wanted_chapters = query.And([query.Term("type", "chapter"),
                             query.Term("name", "hunted")])

# The intersection of those two queries are the chapters we want
complex_query = query.And([children_of_wanted_parents,
                           wanted_children])

OR, at least, that's how it SHOULD work. But I just found a bug in the implementation of NestedChildren's skip_to() method that makes the above example not work :( :( :( The bug is now fixed on Bitbucket, I'll have to make a new release.

Cheers,

Matt

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow