Zend Lucene - Wildcard search based off of Fuzzy search

https://stackoverflow.com/questions/23185648

06-07-2023
|

質問

Goal: execute fuzzy search, then wildcard search with those similar terms

I have a boolean query in place at the moment, shown below:

$query = new Zend_Search_Lucene_Search_Query_Boolean();

$pattern = new Zend_Search_Lucene_Index_Term("*$string*");
$subquery1 = new Zend_Search_Lucene_Search_Query_Wildcard($pattern);

$term = new Zend_Search_Lucene_Index_Term("$string");
$subquery2 = new Zend_Search_Lucene_Search_Query_Fuzzy($term);

$query->addSubquery($subquery1, null  /* optional */);
$query->addSubquery($subquery2, null  /* optional */);

$hits = $index->find($query);

This seems to be executing an either/or search. For example: if I search for the term

"berry"

I hit everything with "berry" anywhere in the title

berry, wild berry, strawberry, blueberry

But if I search for

"bery"

I only hit results like

berry

I'm not exactly sure how the fuzzy search is powered. Is there a way to modify my query so that I can wildcard search after the fuzzy search returns the similar terms?

解決

I suspect that field is not analyzed when indexed.

So, with the first query, you are getting hits from the wildcard query. *berry* matches all of the examples you've given. *bery* doesn't match any of the documents, though, since it's not actually a substring of any of them.

For the fuzzy query, terms are compared by edit distance (Damerau–Levenshtein distance). An edit distance of two is the default maximum for a match.

bery to berry - edit distance: 1
bery to wild berry - edit distance: 6
bery to strawberry - edit distance: 6
bery to blueberry - edit distance: 5

This could be handled in part by using an analyzer, instead of indexing the entire string as a single token. Standard analyzer would split wild berry up into the tokens wild and berry, and you could expect a fuzzy match on that.

As far as strawberry and blueberry, unless your analyzer splits apart straw and berry somehow, you could manually specify terms to split apart by incorporating a SynonymFilter into your analyzer.

Another option would be to attempt to correct the query spelling before searching, using lucene's SpellChecker

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow