Data Comparison

https://stackoverflow.com/questions/126885

02-07-2019
|

Question

We have a SQL Server table containing Company Name, Address, and Contact name (among others).

We regularly receive data files from outside sources that require us to match up against this table. Unfortunately, the data is slightly different since it is coming from a completely different system. For example, we have "123 E. Main St." and we receive "123 East Main Street". Another example, we have "Acme, LLC" and the file contains "Acme Inc.". Another is, we have "Ed Smith" and they have "Edward Smith"

We have a legacy system that utilizes some rather intricate and CPU intensive methods for handling these matches. Some involve pure SQL and others involve VBA code in an Access database. The current system is good but not perfect and is cumbersome and difficult to maintain

The management here wants to expand its use. The developers who will inherit the support of the system want to replace it with a more agile solution that requires less maintenance.

Is there a commonly accepted way for dealing with this kind of data matching?

Solution

Here's something I wrote for a nearly identical stack (we needed to standardize the manufacturer names for hardware and there were all sorts of variations). This is client side though (VB.Net to be exact) -- and use the Levenshtein distance algorithm (modified for better results):

    Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
        Dim bestMatch As String = ""
        Dim bestDistance As Integer = 1000 'Almost anything should be better than that!

        For Each matchCandidate As String In stringList
            Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
            If candidateDistance < bestDistance Then
                bestMatch = matchCandidate
                bestDistance = candidateDistance
            End If
        Next

        Return bestMatch
    End Function

    'This will be used to determine how similar strings are.  Modified from the link below...
    'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
    Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
        Dim sLength As Integer = s.Length ' length of s
        Dim tLength As Integer = t.Length ' length of t
        Dim lvCost As Integer ' cost
        Dim lvDistance As Integer = 0
        Dim zeroCostCount As Integer = 0

        Try
            ' Step 1
            If tLength = 0 Then
                Return sLength
            ElseIf sLength = 0 Then
                Return tLength
            End If

            Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
            Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}

            ' fill first row
            For lvIndex As Integer = 0 To sLength
                poBuffer(lvIndex) = lvIndex
            Next

            'fill first column
            For lvIndex As Integer = 1 To tLength
                poBuffer(lvIndex * (sLength + 1)) = lvIndex
            Next

            For lvRowIndex As Integer = 0 To sLength - 1
                Dim s_i As Char = s(lvRowIndex)
                For lvColIndex As Integer = 0 To tLength - 1
                    If s_i = t(lvColIndex) Then
                        lvCost = 0
                        zeroCostCount += 1
                    Else
                        lvCost = 1
                    End If
                    ' Step 6
                    Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
                    Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
                    Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
                    Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
                    lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
                    poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
                Next
            Next
        Catch ex As ThreadAbortException
            Err.Clear()
        Catch ex As Exception
            WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
        End Try

        Return lvDistance - zeroCostCount
    End Function

OTHER TIPS

SSIS (in Sql 2005+ Enterprise) has Fuzzy Lookup which is designed for just such data cleansing issues.

Other than that, I only know of domain specific solutions - such as address cleaning, or general string matching techniques.

There are many vendors out there that offer products to do this kind of pattern matching. I would do some research and find a good, well-reputed product and scrap the home-grown system.

As you say, your product is only good, and this is a common-enough need for businesses that I'm sure there's more than one excellent product out there. Even if it costs a few thousand bucks for a license, it will still be cheaper than paying a bunch of developers to work on something in-house.

Also, the fact that the phrases "intricate", "CPU intensive", "VBA code" and "Access database" appear together in your system's description is another reason to find a good third-party tool.

EDIT: it's also possible that .NET has a built-in component that does this kind of thing, in which case you wouldn't have to pay for it. I still get surprised once in a while by the tools that .NET offers.

I'm dealing with exactly the same problem. Take a look at:

Tools for matching name/address data

for some tools that might help.

Access doesn't really have the tools for this. In an ideal world I would go with the SSIS solution and use fuzzy lookup. But if you are currently using Access, the chances of your office buying SQL Server Enterprise edition seem low to me. If you are stuck with the current environment, you could try a brute force approach.

Start with standardized cleansing of addresses. PIck standard abbreviations for Street, raod, etc. and write code to change all the normal variations to those standard addesses. Replace any instances of two spaces with one space, trim all the data and remove any non-alphanumeric characters. As you can see this is quite a task.

As for company names, maybe you can try matching on first 5 characters of the name and the address or phone. You could also create a table of known variations and what they will relate to in your database to use for cleanising future files. So if you record with id 100 is Acme, Inc. you could have a table like this:

idfield Name

100 Acme, Inc.

100 Acme, Inc

100 Acme, Incorporated

100 Acme, LLC

100 Acme

This will start small but build over time if you make an entry every time you find and fix a duplicate (make it part of you de-dupping process) and if you make an entry every time you are able to match the first part of the name and address to an existing company.

I'd also look at that function Torial posted and see if it helps.

All of this would be painful and timeconsuming, but would get better over time as you find new variations and add them to the code or list. If you do decide to stardardize your addressdata, make sure to clean production data first, then do any imports to a work table and clean it, then try to match to production data and insert new records.

I just found this link that is related.

I swear that I looked before I posted this.

There's quite a few ways to tackle this that may not be obvious. The best is finding unique identifiers that you can use for matching outside of the fields with mis spellings, etc.

Some thoughts

The obvious, Social security number, drivers license, etc
Email address
Cleansed phone number (Rremove punctuation, etc)

As far as vendors go I just answered a similar question and am pasting below.

Each major provider does have their own solution. Oracle, IBM, SAS Dataflux, etc and each claim to be the best at this kind of problem.

Independent verified evaluation:

There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. Identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost

IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)

SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K) That was the best independent evaluation we could find, was very thorough.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow