سؤال

I am looking for some dedupe software that is compatible with MS SQL Server. I have a rather extensive and messy table that contains addresses from all over the world in all different languages. The table is set up to handle dupes as parent/child records so some functionality to handle a match is required (ie not just deleting a dupe).

Edit: Here's the structure

ParentID | MasterID | PropertyName | Address1 | Address2 | PostalCode | City | StateProvinceCode | CountryCode | PhoneNumber

The MasterID is unique for each record.

ParentID contains the MasterID for the parent record of each entry, and the parent record is where the MasterID = ParentID.

CountryCode is the two letter ISO country code (not telephone code).

هل كانت مفيدة؟

المحلول

Address duplicates are notoriously difficult to track down. There are about 10 valid ways to write one address, which can make for problems.

The fact that you have business rules that allow for duplicates some of the time makes me think you might be better off rolling your own piece of software to find unacceptable dupes and remove them.

In the past I have done this with addresses by putting the address through a free geo-coding service (Google's mapping API for instance) and looking for points that are within a certain threshold of each other (10 feet or something). At this point you can determine if it qualifies as an "unacceptable duplicate" and delete it.

To find distances between coordinates I would recommend finding the Great Circle Distance. Good luck!

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top