Frage

I am just learning about the Metaphone and Double Metaphone search algorithms, and I have a few questions. Per the Metaphone Wiki page, I found a couple sources with implementations, a MySQL implementation in particular. I wanted to test it out with a test database of mine so I first imported the metaphone.sql file (containing the double metaphone function) found here

Right now, I have a table, country, that has a list of all countries in the 'name' column, e.g. 'Afghanistan', 'Albania', 'Algeria', etc. So, first, I wanted to actually create a new column in the table to store the Double Metaphone string of each country. I ran the following code:

UPDATE country SET NameDM = dm(name)

Everything worked correctly. Afghanistan's metaphone string is 'AFKNSTN', Albania's is 'ALPN', Algeria's is 'ALKR;ALJR', etc. "Awesome," I thought.

However, when I tried to query the table, I got no results. Per the author of metaphone.sql, I adhered to the syntax of the following SQL statement:

SELECT Name FROM tblPeople WHERE dm(Name) = dm(@search)

So, I changed this code to the following:

SELECT * FROM country WHERE dm(name) = dm(@search)

Of course, I changed "@search" to whatever search term I was looking for, but I got 0 results after each and every SQL query.

Could anyone explain this issue? Am I missing something important, or am I just plain misunderstanding the Metaphone algorithm?

Thank you!

War es hilfreich?

Lösung

take a close look at the collation/character set/encoding (it can be defined down to the column level). Collation defines how strings are compared, but a character set can imply a certain collation be used. Maybe your literal string has a different character set, causing the string comparison to fail.

even this may be revealing

select name, length(name), char_length(name), @search, length(@search), char_length(@search) from tbl

.

show variables like 'character%'

.

show create table tbl

Andere Tipps

When comparing dm() outputs I use the following function to allow a further level of fuzziness. A direct check dm('smith') != dm('schmitt') fails for a significant number of names, including common misspellings of my own.

The function creates a match weighting between 0.0 and 1.0 (I hope), which allows me to rank each returned row, and select those of benefit, 0.3 is quite a good value for capturing odd pronunciations, 0.5 is more usual.

i.e. dmcompare(dm("boothroyd"), dm("boofreed")) = 0.3
dmcompare(dm("smith"), dm("scmitt")) = 0.5

Notice that this is a comparison of double metaphone strings and not the original strings, this is for performance issues, my DB contains a column for the metaphone as well as the original string.

    CREATE FUNCTION `dmcompare`(leftValue VARCHAR(55), rightValue VARCHAR(55)) 
        RETURNS DECIMAL(2,1) 
    NO SQL
    BEGIN
    ---------------------------------------------------------------------------------------
    -- Compare two (double) metaphone strings for potential similarlity, i.e.
    --    dm("smith") != dm("schmitt")  :: "SM0;XMT" != "XMT;SMT" 
    --  dmcompare( dm('smith'), dm('schmitt' ) returns 0,5
    -- @author: P.Boothroyd
    -- @version: 0.9, 08/01/2013
    -- The values here can still be played with
    -- (c) GNU P L - feel free to share and adapt, but please acknowledge the original code
    ---------------------------------------------------------------------------------------
        DECLARE leftPri, leftSec, rightPri, rightSec VARCHAR(55) DEFAULT '';
        DECLARE sepPos INT;
        DECLARE retValue DECIMAL(2,1);
        DECLARE partMatch BOOLEAN;

        -- Extract the metaphone tags
        SET sepPos = LOCATE(";", leftValue);
        IF sepPos = 0 THEN
            SET sepPos = LENGTH(leftValue) + 1;
        END IF;
        SET leftPri = LEFT(leftValue, sepPos - 1);
        SET leftSec = MID(leftValue, sepPos + 1, LENGTH( leftValue ) - sepPos);

        SET sepPos = LOCATE(";", rightValue);
        IF sepPos = 0 THEN
            SET sepPos = LENGTH(rightValue) + 1;
        END IF;
        SET rightPri = LEFT(rightValue, sepPos - 1);
        SET rightSec = MID(rightValue, sepPos + 1, LENGTH( rightValue ) - sepPos);

        -- Calculate likeness factor
        SET retValue = 0;
        SET partMatch = FALSE;
        -- Primaries equal 50% match
        IF leftPri = rightPri THEN
            SET retValue = retValue + 0.5;
            SET partMatch = TRUE;
        ELSE
            IF SOUNDEX(leftPri) = SOUNDEX(rightPri) THEN
                SET retValue = retValue + 0.3;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Test alternate primary and secondaries, worth 30% match
        IF leftSec = rightPri THEN
            SET retValue = retValue + 0.3;
            SET partMatch = TRUE;
            IF SOUNDEX(leftSec) = SOUNDEX(rightPri) THEN
                SET retValue = retValue + 0.2;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Test alternate primary and secondaries, worth 30% match
        IF leftPri = rightSec THEN
            SET retValue = retValue + 0.3;
            SET partMatch = TRUE;
            IF SOUNDEX(leftPri) = SOUNDEX(rightSec) THEN
                SET retValue = retValue + 0.2;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Are secondary values the same or both NULL
        IF leftSec = rightSec THEN
            -- No secondaries ...
            IF leftSec = '' THEN
                -- If there is prior matching then no secondaries is 40%
                IF partMatch = TRUE THEN
                    SET retValue = retValue + 0.4;
                END IF;
            ELSE
                -- If the secondaries match then 50% match
                SET retValue = retValue + 0.5;
            END IF;
        ELSE
            IF SOUNDEX(leftSec) = SOUNDEX(rightSec) THEN
                IF leftSec = '' THEN
                    IF partMatch = TRUE THEN
                        SET retValue = retValue + 0.3;
                    END IF;
                END IF;
            END IF; 
        END IF;
        RETURN (retValue);
    END

Please feel free to use th code, but also please mention the sources for this code P.Boothroyd with any usage - i.e. changing values etc.

Cheers, Paul

SELECT * FROM country WHERE NameDM = dm(@search)

Is probably what you want in the end so you aren't computing the DM for every country every time you do a search. What you had looks like it should have worked though. You can trouble shoot by doing:

SELECT dm('Albania')

... should get you ALPN. Now what do you get for...

SELECT * FROM country WHERE NameDM = 'ALPN'

?

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top