Preselection of probability for approximate string matching

https://stackoverflow.com/questions/14969894

10-03-2022
|

Question

I have recently been tasked with developing an algorithm for checking for duplicate customer records in a database. The DB layout is pretty straightforward: Tens of thousands of rows with fields like FullName, Street, City, ZIP, Phone, etc...

A little background first:

I have done some extensive research on algorithms, and have decided that every field should be weighed at a certain amount with different algorithms, since not all perform equally well under all circumstances. For example, LastName has a weight factor of 0.50. When I evaluate I choose which algorithms to use and how much they weigh on the final decision:
Factor 0.25: JaroWinkler
Factor 0.60: Cosine 2-Gram Similarity
Factor 0.15: DamerauLevenshtein

Everything works well, and with a little tuning I detect the positives with little error. So far so good. However, as you can imagine, having a runtime of O(n^2) - or actually E form i=0 to i=n - is not very effective when dealing with tens of thousands of records. Needless to say that optimizing aggressively, using compiler optimizations for speed, multithreading, etc, are just bandaids since the real problem is the complexity.

Essentially, I am looking for a way to pre-filter the potential matches, and have done three days of research on this now. I have found some valuable information on R-Trees, R*-Trees, KD-Trees, Eucledian vectors, minhashing, et al. However, most information about all these is, well, rather highly academic. The most valuable resource I found was in "Mining Massive Data Sets", Chapter 3.

Now to my real question:

I've read all this information, but I'm not sure how to put it all together.

I was thinking about some sort of indexing in a tree or graph data structure where I can putin a string and say "Find me all that have a probability of matching > 0.20". This algorithm should be really fast. Then, when I get a list of potential (>0.20) matches, I could go and compare the few items with my "expensive", but selective algorithm. That should cut the runtime to a very reasonable value I believe.

I have been trying to find some sort of reference code to do what I want to do above, but I don't seem to come up with anything other than scholarly articles. I did find "simstring", which actually compiled, but didn't seem to match 7 test records very well.. Could anyone point me in the right direction? Surely someone must have run into this before and found a solution...

Thank you very much in advance!

P.S. I am doing this in C++, but any samples in C#/C/Java/PHP would be fine.

Solution 2

I ahve finally succeeded in implementing a preselection by doing the following: 1. Use certain fields of the customer record to construct 2Grams 2. Minhash the 2Grams with a familiy of 6 minhash functions to a 192 bit signature 3. Use the boost::geometry libraries' rtree implementation to create a 6 dimensional spatial index over the signatures 4. Select the nearest k (im my case 30) records for the record I'm comparing and on those candidates run the original "expensive" comparison 5. This reduces the complexity from E(i,i=n,i=1) to roughly 30n + m, where m is the time it takes to build the index (almost negligible, surprisingly).

I can run 15,000 comparisons with high accuracy in 60 seconds now, and that is in a single-threaded test. Multithreaded to 4 or 8 cores this will run even faster.

OTHER TIPS

As a first cut at this, I'd simply select those strings close enough to the same length that they could match to within the given probability. This won't be very selective, but (unless you specify quite loose tolerances) will probably eliminate quite a large percentage of impossible matches very quickly. (e.g., with an edit metric like Levenshtein that counts an insertion as 1 operation, if you start with a string of length 5 and a need to match within 5 operations, then you can eliminate all strings longer than 10 without further examination).

Whether this will be selective enough to go straight to your expensive comparison is open to question -- obviously that'll depend on the variability of the lengths of strings you're matching against.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow