“Did you mean” feature on a dictionary database
-
02-10-2019 - |
Question
I have a ~300.000 row table; which includes technical terms; queried using PHP and MySQL + FULLTEXT indexes. But when I searching a wrong typed term; for example "hyperpext"; naturally giving no results.
I need to "compansate" little writing errors and getting nearest record from database. How I can accomplish such feaure? I know (actually, learned today) about Levenshtein distance, Soundex and Metaphone algorithms but currently not having a solid idea to implement this to querying against database.
Best regards. (Sorry about my poor English, I'm trying to do my best)
Solution
See this article for how you might implement Levenshtein distance in a MySQL stored function.
For posterity, the author's suggestion is to do this:
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
He also supplies a LEVENSHTEIN_RATIO helper method which will evaluate the ratio of different/total characters, rather than a straight edit distance. For instance, if it's 60%, then three-fifths of the characters in the source word are different from the destination word.
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END
OTHER TIPS
From the comments of http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html
now i download the package from the mysql udf repository http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/
wget http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/dludf.cgi?ckey=28
ll
tar -xzvf dludf.cgi\?ckey\=28
gcc -shared -o libmysqllevenshtein.so mysqllevenshtein.cc -I/usr/include/mysql/
mv libmysqllevenshtein.so /usr/lib
mysql -uroot -pPASS
mysql> use DATABASE
mysql> CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';
mysql> select levenshtein(w1.word,w2.word) as dist from word w1, word w2 where ETC........... order by dist asc limit 0,10;
I suggest that you generate typo variations on the query input.
i.e. hyperpext > { hyperpeext, hipertext, ... } etc
One of these is bound to be the correct spelling (especially for common misspellings)
The way you identify the most likely match is to do a lookup for each on an index which tells you the document frequency of the term. (make sense?)
Why not add a table column for storing the word in its alternate (e.g., Soundex) form? that way, if your first SELECT does not find the exact match, you can do a second search to look for matching alternate forms.
The trick is to encode each word so that misspelled variations end up converted into the same alternate form.