Pergunta

We have developed a web based application for name matching. It operates by breaking names into parts and the Soundex value of each part is stored in a database. The Levenshtein distance metric is used to apply percentage matching of sound as well as spelling against a given name.

At runtime, we load all records into memory and apply the Levenshtein distance to all of the Soundex values and the spelling of all the parts of all the names.

This was working fine at first because there were at maximum 20 thousand names, but now one of our clients has 30 million names. Loading this huge list in memory for each request and applying this type of matching is a pathetic approach, using a lot of memory and execution time.

We are looking for suggestions to search database of 30 million records or more in near future with percentage matching of Sound and Spelling.

Core Functionality

End user enters the name to be matched and minimum percentage. We are supposed to show all those names in database for which any part of the name matches with any part of the given name upto the given percentage. Full name is not required to be matched, any part if matches upto the percentage is a success. For example.

Given Name: Helen Hunt
Name in DB: Holly Hunter 

Both parts of both names are not matching exactly but upto some extent, let us assume 80%, so if user enter 80% then the name in DB must be shown as matching name.

Foi útil?

Solução

Without knowing the full details of what you need, you probably want to do one of the following:

I don't fully know what's involved installing and configuration sphinx; but, I'm under the impression you can point it at a database, tell it which fields to index, how to weight the results, and it'll give you an ordered list of matching records back.

For user-facing or mission critical stuff, use an existing search tool.

If you're just feeling academic ... Play with ngrams:

An ngrams lookup table can serve as your initial set of potential matches, and you can use Levenshtein distances to prune and sort the results.

Assuming you want to search people, you might do something like:

_ people _________
personId: int
name: varchar
soundex_name: varchar

_ people_ngrams __
personId: int
ngramId: int

_ ngrams _________
ngramId: int
ngram: char(3)
count: int

You can either periodically rebuild your ngrams or build them on-the-fly. Either way, a simple, naive search algorithm can look like this:

search_ngrams = ngrammify(soundex(search_string));

notable_ngrams = select top 10 *
  from ngrams
  where ngram in (search_ngrams)
  order by count asc;

possible_matches = select top 1000 distinct people.*
  from people_ngrams, people
  where ngramId in (notable_ngrams);

best_matches = top 100 possible_matches
  ordered by Levenshtein_distance(match, soundex(search_string));

Using something pretty similar to this (but with a little more ngram "popularity" tuning, blacklists, whitelists, etc.), I've seen this sort of algorithm fuzzily merge records between data sets in bulk, as well as facilitate custom fuzzy search utilities and ongoing records de-duplication efforts.

Now, in my case, I wasn't matching millions of records, I was looking to select the best possible merges between a two data sets on the order of hundreds of thousands of records each. And, we wanted it to work fairly quickly -- within a few minutes. (Quick, what's 100,000 * 100,000?) And, we were successful.

So, with the right tuning, this sort of thing can be snappy and effective. We were ultimately able produce a merged set on a humble, dated, dual-core machine in a few minutes, with "questionable" merges automatically flagged for manual review. But, it took a lot of time to find the ngram popularity/relevance sweet-spot, and the right string-distance thresholds, and blacklists, and whitelists ... etc.

THAT SAID, you can really get sucked into a hole working on this stuff. For any real-world production-level stuff, you should generally use a well-established tool that's already made and optimized for this kind of searching.

Like Sphinx or Lucene.

Licenciado em: CC-BY-SA com atribuição
scroll top