Question

Here's an interesting problem.

I have an oracle database with name & address information which needs to be kept current.

We get data feeds from a number of different gov't sources, and need to figure out matches, and whether or not to update the db with the data, or if a new record needs to be created.

There isn't any sort of unique identifier that can be used to tie records together, and the data quality isn't always that good - there will always be typos, people using different names (i.e. Joe vs. Joseph), etc.

I'd be interested in hearing from anyone who's worked on this type of problem before as to how they solved it, or at least automated parts of it.

Was it helpful?

Solution

Each of the major software companies that are active in this space offer solution suites that handle name and address parsing, data standardization, record deduplication or matching, record linking/merging, survivorship, and so on. They're all a bit pricey, though.

For example, Oracle's own solution for this problem is the product "Oracle Data Quality (ODQ) for Oracle Data Integrator (ODI)," which is part of their Fusion Middleware stack. As the name implies, ODQ requires ODI (i.e., it is an add-on module that is licensed separately and is dependent on ODI).

IBM's Websphere solution suite (obtained through their Ascential acquisition) includes QualityStage.

Business Objects, now an SAP company, has a Data Quality product under its Enterprise Information Management (EIM) suite.

Other major data quality brands include Dataflux (a SAS company) and Trillium Software (a Harte-Hanks company)

The Gartner Group releases an annual Magic Quadrant for data quality solution suites. Vendors who rate well in these Magic Quadrants usually make the entire report available online to registered users on their website (example 1, example 2).

OTHER TIPS

On the open source front, there is now a python library for doing this kind of record deduplication / entity resolution: Dedupe.

Requires programming skills, but it's free and can be run on a laptop, as opposed to a huge server.

Here's an overview of how it works.

Lessons Learned from using FirstLogic. (Other products should be similar.) For context, we were getting files from dozens of sources, and each would be in a different format. We needed to know who needed to be added to our database, who needed to be updated, and who just needed to be marked as "on the list".

  1. I expected the software would look at an entire file and determine that--say--if column A had last names in 98 out of a 100 rows then column A must be the last name column. That's not true; each record is processed individually.

  2. If you know exactly what data is in what field then you can tell the software that and it will handle it. But if you're not always sure you're better off just giving it the whole string and letting the software figure it out. An example of this is a last name field that contains something like "Smith Jr MD". If you declare it to be the last name column it will assume the last name is "Smith Jr MD". But if you pass "John Smith Jr MD" and let the software figure it out it will correctly identify all the bits.

  3. Some things that seemed obvious were not. For example, out of the box FirstLogic doesn't assume that a comma means "last name, first name". It actually has a great big list of last names and a great big list of first names, so it figures it can just ignore the comma. We consistently had issues with people like "John, Thomas". And sometimes it would get confused because what looked like an obvious last name to us was actually a first name in Spanish or something.

  4. It's hard to use the system to add dummy data. If you have people named things like "Test Account" or "TBD" they won't be processed at all--FirstLogic will throw the record away because it didn't find any bits of data it recognized.

  5. Customizing the system is possible, but not as easy as the sales people make it sound. There are a million options and custom files and dictionaries. You'll need to invest the time to understand how it works if you expect to be able to customize it. It's like a RDBMS or an ETL tool. It's not a magic black box.

  6. It also has a ton of data quality features that help to justify buying the software but that require a dedicated effort to learn and apply.

  7. The tools aren't really designed to process a list against a master file; they were created for merge-purge operations. It's possible (we're doing it), but it requires that you treat your database a a list (which requires you to extract all the live records into a flat file). Also, the more control you want over which records go where (i.e. if the name is an 80% match and the first two digits of the zip code are identical, then it's probably the same person, etc), the complicated your batch process is going to be.

  8. Finally, we realized that processing "John Smith" and processing "Annabelle DiGiovanni" are very different beasts when you're trying to determine if two people at different addresses are actually the same. You can fine-tune the software so that uncommon names are given more weight on a match than common names, but who has the time? Our matching process is about 80% accurate and we were satisfied with manually processing the remaining 20%.

Don't think of it as a tool to automatically do the matching. Think of it as a tool to allow your human data processors to be more productive. That way you've set it up so that if you hit 80% accuracy it's a resounding success rather than short of the mark.

At my current job, we have a lot of data integrity issues. We try to "scrub" data before it is loaded to ensure quality. We currently use Melissa Data to scrub names, addresses and emails. It does a pretty good job standardizing things before we load it. That way, we can avoid duplicate data.

Also, SQL Server Integration Services (I know you are using Oracle) has a component which does "fuzzy" matching on strings, allowing you to find a near match rather than a straight one-to-one match. It is called Fuzzy Lookup. Basically it gives you a score, telling you how close to a match the two inputs are... You can then set threshholds to tell it how close it has to be to consider it a match (i.e. within 80% or whatever).

In the past, I have also used SOUNDEX to determine if names were similar in determining duplicates. There are lots of SOUNDEX implementations.

Good luck.

The term for the problem you're looking to solve is "Record Linking".

I cannot recommend a specific tool. Do a search for "record linking software" or "merge purge software" and with a little research you should be able to find something that meets your needs.

I was able to find one open source solution, named Febrl. This tool was originally developed for the realm of biomedical research, which explains the name "Freely extensible biomedical record linkage." You can read more about it here: "Parallel Large Scale Techniques for High-Performance Record Linkage"

For a good overview of the problems/solutions involved in record linking read about "Merge/Purge and Duplicate Detection".

Each major provider does have their own solution. Oracle, IBM, SAS Dataflux, etc and each claim to be the best.

Independent verified evaluation:

There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. Identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

  1. DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost
  2. IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)
  3. SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K)

That was the best independent evaluation we could find, was very thorough.

I did something like this on membership enrollment lists for an insurance company. Luckily we had SSN's to determine the primary member, but I had to decide if dependents were new or existing (by name alone). I tried a soundex algorithm as part of a solution, but it didn't seem to fit the bill, because it seemed to map too many different names to the same thing (and many families tend to name their kids with similar sounding names). I ended up just using the (I think) the first four letters of the dependents first name as a unique identifier, and calling it "good enough." I don't know how I'd handle George Foreman's family :-)

I have seen Dataflux in action, and, according to our "star" data analyst, it is the most accurate tool for "clustering" (as Dataflux calls it) that he has found for linking names of peoples and companies. It is pretty expensive, though.

It's a classic problem of record linkage. Typos, missing letters, etc. can be handled with probabilistic approach by defining set of rules for blocking.

One software that could do it nicely for you is FRIL. In Java, free and very friendly. There are links to tutorial and videos on the website. It supports DB connectivity, although I'm not sure about the oracle.

RecLink could be another way to go, although you would probably have to work harder to have it up and running since there are no manuals/tutorials (to my knowledge).

From free options, there is also RELAIS.

I used dataladder.com for a big dataset of names. They did a fantastic job matching up different databases with very different spellings and cut through a lot of duplicates.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top