How exactly does trigram word-similarity work?

https://dba.stackexchange.com/questions/184716

08-10-2020
|

Domanda

The docs on the word_similarity function say:

Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the words of the second string).

But they are a bit vague about what the result should be if the 'first string' contains more than one word. What is the explanation for the following results:

select word_similarity('foo bar','foo bar baz');

| word_similarity |
| :-------------- |
| 1               |

select word_similarity('baz bar','foo bar baz');

| word_similarity |
| :-------------- |
| 1               |

select word_similarity('baz foo','foo bar baz');

| word_similarity |
| :-------------- |
| 0.8             |

dbfiddle here

Soluzione

It doesn't work like that, from the a mailing list thread in 2017-2018

We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams. Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for part of word. And we already have users adopt this feature.

The solution was to release pg_trgm 1.4 with new features which should drop with PostgreSQL 11. It will provide strict_word_similarity which will force extent boundaries to be at word boundaries. In addition there will be two new operators <<<-> and <<% which will do strict-word-extent operations on the index.

For more information see the docs on the development version of pg_trgm

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange