Data Deduplication algorithm for large number of contacts

Question 1

Found a partial solution using simhash algorithm. Found a good example here http://simhash.codeplex.com/

Question 2

This whole area of research is generally known as record linkage (ironically, it has about a dozen duplicate names). There are quite a few tools out there that will let you configure matching for your specific data, churn through the data, and output the duplicates for you. Some tools will even create a matching for you, if you answer some questions about the correctness of potential matches.

Q/N-gram comparison (and indexing) is one possible way to solve this, but there is a whole raft of others. You'll quickly find that different types of comparators work well for different types of data. I haven't tried Q-gram indexing myself, but researchers in the field have described it to me as relatively slow.

As for comparison with phonetic key functions (like Soundex or Metaphone or whatever) this is only suitable when you have small text fields, like separate fields for given name, family name, middle name etc. Even then these functions tend to be rather coarse. And beware of Soundex. It not only is extremely coarse, turning very different names into the same key, but also misses a number of similar names that should be treated the same. Metaphone is much better.

The Wikipedia page for record linkage used to have a list of tools, but the editors removed it. I wrote an open source tool called Duke to solve this kind of thing. It has a genetic algorithm that can help you create a configuration, or you can write one manually. Other tools exist, too.

I would advise you to use one of the tools that already exist rather than attempting to solve this from scratch.

Question 3

I believe the best option for deduping names is by means of a phonetic encoder. A phonetic encoder will be able to dedup alternative spellings of the same name, here is an example with some common names:

Group: Kathryn names: [Kathryn, Katharine, Katherin, Katherynn, Kathrynn, Katherynne, Kathrynne, Catherine, Cathryn, Catharine, Catherin, Catherynn, Cathrynn, Catherynne, Cathrynne]
Group: Assaf names: [Assaf, Asaf]
Group: Megan names: [Megan, Meagan, Meghan, Meaghan]
Group: Allison names: [Allison, Alyson, Allyson, Alison, Allisyn]
==============================================================
Phonetic Encoder: Caverphone2
---- Names Group: Kathryn ----
Encoded names: {KTRN111111=16}
---- Names Group: Assaf ----
Encoded names: {ASF1111111=3}
---- Names Group: Megan ----
Encoded names: {MKN1111111=5}
---- Names Group: Allison ----
Encoded names: {ALSN111111=6}
==============================================================
Phonetic Encoder: DoubleMetaphone
---- Names Group: Kathryn ----
Encoded names: {K0RN=16}
---- Names Group: Assaf ----
Encoded names: {ASF=3}
---- Names Group: Megan ----
Encoded names: {MKN=5}
---- Names Group: Allison ----
Encoded names: {ALSN=6}
==============================================================
Phonetic Encoder: Nysiis
---- Names Group: Kathryn ----
Encoded names: {CATRYN=7, CATARA=6, CATARY=5}
---- Names Group: Assaf ----
Encoded names: {ASAF=3}
---- Names Group: Megan ----
Encoded names: {MAGAN=5}
---- Names Group: Allison ----
Encoded names: {ALASAN=3, ALYSAN=3, ALASYN=2}
==============================================================
Phonetic Encoder: Soundex
---- Names Group: Kathryn ----
Encoded names: {K365=8, C365=9}
---- Names Group: Assaf ----
Encoded names: {A210=3}
---- Names Group: Megan ----
Encoded names: {M250=5}
---- Names Group: Allison ----
Encoded names: {A425=6}
==============================================================
Phonetic Encoder: RefinedSoundex
---- Names Group: Kathryn ----
Encoded names: {C30609080=5, K3060908=5, K30609080=4, C3060908=5}
---- Names Group: Assaf ----
Encoded names: {A0302=3}
---- Names Group: Megan ----
Encoded names: {M80408=5}
---- Names Group: Allison ----
Encoded names: {A070308=6}
==============================================================

In the example you can see that for Caverphone and DoubleMetaphone all names were encoded to the same string. you should see what makes sense for your data, the encoder to use depends on the language and etymology of the names (i.e. english names, german names...)