Analyzer to autocomplete names

https://stackoverflow.com/questions/17017216

31-05-2022
|

Frage

I want to be able autocomplete names.

For example, if we have the name John Smith, I want to be able to search for Jo and Sm and John Sm to get the document back.

In addition, I do not want jo sm matching the document.

I currently have this analyzer:

return array(
    'settings' => array(
        'index' => array(
            'analysis' => array(
                'analyzer' => array(
                    'autocomplete' => array(
                        'tokenizer' => 'autocompleteEngram',
                        'filter' => array('lowercase', 'whitespace')
                    )
                ),

                'tokenizer' => array(
                    'autocompleteEngram' => array(
                        'type' => 'edgeNGram',
                        'min_gram' => 1,
                        'max_gram' => 50
                    )
                )
            )   
        )
    )
);

The problem with this is that first we split the text up and then tokenize using edgengrams.

This results in this: j jo joh john s sm smi smit smith

This means, if I search for john smith or john sm, nothing would be returned.

So, I need to be generate tokens that look like this: j jo joh john s sm smi smit smith john s john sm john smi john smit john smith.

How can I set up my analyzer so that I generates those extra tokens?

Lösung

I ended up not using edgengrams.

I created an analyzer with the standard tokenizer, and standard and lowercase filters. This is virtually identical to the standard analyser, but does not have any stopwords filter (we are searching for names after all, and there might be someone called The or An etc).

I then set the above analyzer as the index_analyzer and simple as the search_analyzer. Using this setup with a match_phrase_prefix query worked really well.

This is the custom analyser I used (called autocomplete and expressed in PHP):

'autocomplete' => array(
                        'tokenizer' => 'standard',
                        'filter' => array('standard', 'lowercase')
                ),

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow