Why isn't this MySQL double metaphone function working correctly?
質問
I am just learning about the Metaphone and Double Metaphone search algorithms, and I have a few questions. Per the Metaphone Wiki page, I found a couple sources with implementations, a MySQL implementation in particular. I wanted to test it out with a test database of mine so I first imported the metaphone.sql file (containing the double metaphone function) found here
Right now, I have a table, country, that has a list of all countries in the 'name' column, e.g. 'Afghanistan', 'Albania', 'Algeria', etc. So, first, I wanted to actually create a new column in the table to store the Double Metaphone string of each country. I ran the following code:
UPDATE country SET NameDM = dm(name)
Everything worked correctly. Afghanistan's metaphone string is 'AFKNSTN', Albania's is 'ALPN', Algeria's is 'ALKR;ALJR', etc. "Awesome," I thought.
However, when I tried to query the table, I got no results. Per the author of metaphone.sql, I adhered to the syntax of the following SQL statement:
SELECT Name FROM tblPeople WHERE dm(Name) = dm(@search)
So, I changed this code to the following:
SELECT * FROM country WHERE dm(name) = dm(@search)
Of course, I changed "@search" to whatever search term I was looking for, but I got 0 results after each and every SQL query.
Could anyone explain this issue? Am I missing something important, or am I just plain misunderstanding the Metaphone algorithm?
Thank you!
解決
take a close look at the collation/character set/encoding (it can be defined down to the column level). Collation defines how strings are compared, but a character set can imply a certain collation be used. Maybe your literal string has a different character set, causing the string comparison to fail.
even this may be revealing
select name, length(name), char_length(name), @search, length(@search), char_length(@search) from tbl
.
show variables like 'character%'
.
show create table tbl
他のヒント
When comparing dm()
outputs I use the following function to allow a further level of fuzziness. A direct check dm('smith') != dm('schmitt')
fails for a significant number of names, including common misspellings of my own.
The function creates a match weighting between 0.0 and 1.0 (I hope), which allows me to rank each returned row, and select those of benefit, 0.3 is quite a good value for capturing odd pronunciations, 0.5 is more usual.
i.e.
dmcompare(dm("boothroyd"), dm("boofreed")) = 0.3
dmcompare(dm("smith"), dm("scmitt")) = 0.5
Notice that this is a comparison of double metaphone strings and not the original strings, this is for performance issues, my DB contains a column for the metaphone as well as the original string.
CREATE FUNCTION `dmcompare`(leftValue VARCHAR(55), rightValue VARCHAR(55)) RETURNS DECIMAL(2,1) NO SQL BEGIN --------------------------------------------------------------------------------------- -- Compare two (double) metaphone strings for potential similarlity, i.e. -- dm("smith") != dm("schmitt") :: "SM0;XMT" != "XMT;SMT" -- dmcompare( dm('smith'), dm('schmitt' ) returns 0,5 -- @author: P.Boothroyd -- @version: 0.9, 08/01/2013 -- The values here can still be played with -- (c) GNU P L - feel free to share and adapt, but please acknowledge the original code --------------------------------------------------------------------------------------- DECLARE leftPri, leftSec, rightPri, rightSec VARCHAR(55) DEFAULT ''; DECLARE sepPos INT; DECLARE retValue DECIMAL(2,1); DECLARE partMatch BOOLEAN; -- Extract the metaphone tags SET sepPos = LOCATE(";", leftValue); IF sepPos = 0 THEN SET sepPos = LENGTH(leftValue) + 1; END IF; SET leftPri = LEFT(leftValue, sepPos - 1); SET leftSec = MID(leftValue, sepPos + 1, LENGTH( leftValue ) - sepPos); SET sepPos = LOCATE(";", rightValue); IF sepPos = 0 THEN SET sepPos = LENGTH(rightValue) + 1; END IF; SET rightPri = LEFT(rightValue, sepPos - 1); SET rightSec = MID(rightValue, sepPos + 1, LENGTH( rightValue ) - sepPos); -- Calculate likeness factor SET retValue = 0; SET partMatch = FALSE; -- Primaries equal 50% match IF leftPri = rightPri THEN SET retValue = retValue + 0.5; SET partMatch = TRUE; ELSE IF SOUNDEX(leftPri) = SOUNDEX(rightPri) THEN SET retValue = retValue + 0.3; SET partMatch = TRUE; END IF; END IF; -- Test alternate primary and secondaries, worth 30% match IF leftSec = rightPri THEN SET retValue = retValue + 0.3; SET partMatch = TRUE; IF SOUNDEX(leftSec) = SOUNDEX(rightPri) THEN SET retValue = retValue + 0.2; SET partMatch = TRUE; END IF; END IF; -- Test alternate primary and secondaries, worth 30% match IF leftPri = rightSec THEN SET retValue = retValue + 0.3; SET partMatch = TRUE; IF SOUNDEX(leftPri) = SOUNDEX(rightSec) THEN SET retValue = retValue + 0.2; SET partMatch = TRUE; END IF; END IF; -- Are secondary values the same or both NULL IF leftSec = rightSec THEN -- No secondaries ... IF leftSec = '' THEN -- If there is prior matching then no secondaries is 40% IF partMatch = TRUE THEN SET retValue = retValue + 0.4; END IF; ELSE -- If the secondaries match then 50% match SET retValue = retValue + 0.5; END IF; ELSE IF SOUNDEX(leftSec) = SOUNDEX(rightSec) THEN IF leftSec = '' THEN IF partMatch = TRUE THEN SET retValue = retValue + 0.3; END IF; END IF; END IF; END IF; RETURN (retValue); END
Please feel free to use th code, but also please mention the sources for this code P.Boothroyd with any usage - i.e. changing values etc.
Cheers, Paul
SELECT * FROM country WHERE NameDM = dm(@search)
Is probably what you want in the end so you aren't computing the DM for every country every time you do a search. What you had looks like it should have worked though. You can trouble shoot by doing:
SELECT dm('Albania')
... should get you ALPN. Now what do you get for...
SELECT * FROM country WHERE NameDM = 'ALPN'
?