Question

I'm building a search engine for a django/python site. One requirement is a soundex feature, so that if someone searches for "smith" or "johnson" the search will return homonyms like "smyth" or "jonsen". The database is MySQL, FWIW.

What's recommended as a good approach? Right now I'm leaning towards something like Haystack + Whoosh, with something to capture the soundex feature.

Thanks in advance for your help.

Was it helpful?

Solution

MySQL has a soundex() function. Docs are here. But the soundex algorithm was originally developed to aid in searching for Anglo-Saxon names in English. It's probably not the best choice these days.

You're probably better off with either metaphone or double metaphone.

In any case, most people store the result. That makes it easy to index, and searching is usually pretty fast.

Data integrity is a problem, though. Ideally, I'd want to do something like this.

create table persons (
  ...
  last_name varchar(25) not null,
  last_name_phonetic varchar(6) not null,  -- Not sure about the length
  check (last_name_phonetic = double_metaphone(last_name))
  ...
);

But that requires your dbms to have either an intrinsic double_metaphone() function, or support user-defined functions in CHECK() constraints. MySQL doesn't enforce CHECK() constraints at all, so you'd need to implement this in triggers if your application needs this kind of data integrity.

For what it's worth, PostgreSQL has a contrib module, fuzzystrmatch, that implements soundex, metaphone, double metaphone, and Levenshtein distance functions. If it were up to me, I'd build this in PostgreSQL rather than MySQL.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top