Pergunta

I'm developing an application which must be able to find & merge duplicates in a Hundreds of thousands of contact information stored in sql server DB. I have to compare all the columns in the table, each column has a weight value. The comparison must work based on the weight value. Based on the comparison result & degree of equivalence i have to decide to merge the contacts automatically or request user attention. I know that there are number of fuzzy logic algorithms for deduplication.

Read about N-gram or Q-gram-based Algorithms in http://www.melissadata.com/. Is this algorithm feasible for a large set of data? If not can any one guide me with some algorithm or tel me where to start with?

An example of what i want to achieve,

Gonzales = Gonzalez (two different spelling of different name)
Smith = Smyth (Phonetic sound the same)
123 Main st = 123 Main street (abbrevation)
Bob Smith = Robert Smith (synonym)
Foi útil?

Solução 3

Found a partial solution using simhash algorithm. Found a good example here http://simhash.codeplex.com/

Outras dicas

This whole area of research is generally known as record linkage (ironically, it has about a dozen duplicate names). There are quite a few tools out there that will let you configure matching for your specific data, churn through the data, and output the duplicates for you. Some tools will even create a matching for you, if you answer some questions about the correctness of potential matches.

Q/N-gram comparison (and indexing) is one possible way to solve this, but there is a whole raft of others. You'll quickly find that different types of comparators work well for different types of data. I haven't tried Q-gram indexing myself, but researchers in the field have described it to me as relatively slow.

As for comparison with phonetic key functions (like Soundex or Metaphone or whatever) this is only suitable when you have small text fields, like separate fields for given name, family name, middle name etc. Even then these functions tend to be rather coarse. And beware of Soundex. It not only is extremely coarse, turning very different names into the same key, but also misses a number of similar names that should be treated the same. Metaphone is much better.

The Wikipedia page for record linkage used to have a list of tools, but the editors removed it. I wrote an open source tool called Duke to solve this kind of thing. It has a genetic algorithm that can help you create a configuration, or you can write one manually. Other tools exist, too.

I would advise you to use one of the tools that already exist rather than attempting to solve this from scratch.

I believe the best option for deduping names is by means of a phonetic encoder. A phonetic encoder will be able to dedup alternative spellings of the same name, here is an example with some common names:

Group: Kathryn names: [Kathryn, Katharine, Katherin, Katherynn, Kathrynn, Katherynne, Kathrynne, Catherine, Cathryn, Catharine, Catherin, Catherynn, Cathrynn, Catherynne, Cathrynne]
Group: Assaf names: [Assaf, Asaf]
Group: Megan names: [Megan, Meagan, Meghan, Meaghan]
Group: Allison names: [Allison, Alyson, Allyson, Alison, Allisyn]
==============================================================
Phonetic Encoder: Caverphone2
---- Names Group: Kathryn ----
Encoded names: {KTRN111111=16}
---- Names Group: Assaf ----
Encoded names: {ASF1111111=3}
---- Names Group: Megan ----
Encoded names: {MKN1111111=5}
---- Names Group: Allison ----
Encoded names: {ALSN111111=6}
==============================================================
Phonetic Encoder: DoubleMetaphone
---- Names Group: Kathryn ----
Encoded names: {K0RN=16}
---- Names Group: Assaf ----
Encoded names: {ASF=3}
---- Names Group: Megan ----
Encoded names: {MKN=5}
---- Names Group: Allison ----
Encoded names: {ALSN=6}
==============================================================
Phonetic Encoder: Nysiis
---- Names Group: Kathryn ----
Encoded names: {CATRYN=7, CATARA=6, CATARY=5}
---- Names Group: Assaf ----
Encoded names: {ASAF=3}
---- Names Group: Megan ----
Encoded names: {MAGAN=5}
---- Names Group: Allison ----
Encoded names: {ALASAN=3, ALYSAN=3, ALASYN=2}
==============================================================
Phonetic Encoder: Soundex
---- Names Group: Kathryn ----
Encoded names: {K365=8, C365=9}
---- Names Group: Assaf ----
Encoded names: {A210=3}
---- Names Group: Megan ----
Encoded names: {M250=5}
---- Names Group: Allison ----
Encoded names: {A425=6}
==============================================================
Phonetic Encoder: RefinedSoundex
---- Names Group: Kathryn ----
Encoded names: {C30609080=5, K3060908=5, K30609080=4, C3060908=5}
---- Names Group: Assaf ----
Encoded names: {A0302=3}
---- Names Group: Megan ----
Encoded names: {M80408=5}
---- Names Group: Allison ----
Encoded names: {A070308=6}
==============================================================

In the example you can see that for Caverphone and DoubleMetaphone all names were encoded to the same string. you should see what makes sense for your data, the encoder to use depends on the language and etymology of the names (i.e. english names, german names...)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top