Multiple synonym dictionary matches in PostgreSQL full text searching
-
05-07-2019 - |
Question
I am trying to do full text searching in PostgreSQL 8.3. It worked splendidly, so I added in synonym matching (e.g. 'bob' == 'robert') using a synonym dictionary. That works great too. But I've noticed that it apparently only allows a word to have one synonym. That is, 'al' cannot be 'albert' and 'allen'.
Is this correct? Is there any way to have multiple dictionary matches in a PostgreSQL synonym dictionary?
For reference, here is my sample dictionary file:
bob robert
bobby robert
al alan
al albert
al allen
And the SQL that creates the full text search config:
CREATE TEXT SEARCH DICTIONARY nickname (TEMPLATE = synonym, SYNONYMS = nickname);
CREATE TEXT SEARCH CONFIGURATION dxp_name (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION dxp_name ALTER MAPPING FOR asciiword WITH nickname, simple;
What am I doing wrong? Thanks!
Solution
That's a limitation in how the synonyms work. What you can do is turn it around as in:
bob robert
bobby robert
alan al
albert al
allen al
It should give the same end result, which is that a search for either one of those will match the same thing.
OTHER TIPS
A dictionary must define a functional relationship between words and lexemes otherwise it won't know which word to return when you lexize. In your example, al
maps to three different values thus defining a multi-valued function and the lexize function doesn't know what to return. As Magnus shows, you can lexize from the proper names alan, albert, allen
to the nickname al
.
Remember however, that the point of an FTS dictionary is not to perform transformations per se but to allow efficient indexing on semantically relevant words. This means that the lexeme need not resemble the original entry in any linguistic sense. Although you're right that a many-to-many relationship is impossible to define, do you really need to? For example, to resolve your vin
example:
vin vin
vincent vin
vincenzo vin
vinnie vin
but you could also do this:
vin grob
vincent grob
vincenzo grob
vinnie grob
and get the same effect (although why you'd want to is another story).
Thus if you were to parse a document with say 11 versions of the name Vincent then the to_tsvector
function would return vin:11
in the former case and grob:11
in the latter.
In the 8.4 documentation, it talks about a replacement synonym dictionary, maybe that will be helpful?
http://www.postgresql.org/docs/8.4/interactive/dict-xsyn.html