Data Architecture: Deduplication of product catalogs

https://stackoverflow.com/questions/12434362

02-07-2021
|

Domanda

I'm thinking out my strategy for merging (and de-duplicating) multiple catalogs of products.

I'll be using a no-sql database, and need to query N catalogs of partially overlapping products.

Certain aspects such as categorization, tags, descriptions, etc need to be normalized, and I need to track what catalogs contain each unique item (de-duplication of products in each catalog, by UPC for example).

My current thought is to import the individual catalogs into their own tables, then use self-built algorithms to identify "similar" items, perform normalization, then create a final "Master" table which contains the normalized & de-duplcated data - (the master-record values would be copied from whichever catalog or mix of catalogs it's chosen from & contain links to which catalogs contain that item).

I wonder what other thoughts exist on the subject? What areas of research should I look into to better educate myself?

Soluzione

You didn't supply a lot of details but from what I understand, if you'd be using HBase you can do the following:

write all the data into hbase in the original formats or close to that
write a map/reduce to sort things out:

2.1. in the map phase normalize and emit the potential keys

2.2. int he reduce phase (where you get all the records with the same key) produce the master record
export master record to where you'd like

Altri suggerimenti

There are some local companies here that generate Sql schemas nightly/weekly from NoSql for reporting purposes.

From what I understand the approach used is by them is exactly how you have described. I believe that your data set is very large you shouldn't have any issues with that strategy.

This has been a huge area of research since the 1940s (no, honestly) under the name record linkage (but unfortunately it's also known by many other names, like "identity resolution", "data matching", "merge/purge", etc etc). There's a huge amount to learn here, and people have developed lots of techniques and tools that you can use. I would strongly recommend that you familiarize yourself with these before trying to write something yourself.

Note that a key problem is going to be performance. You basically have to compare all record pairs (which is O(n^2)), and you should ideally use fuzzy string comparators (which are all slow). That alone is a good reason for using a tool that already has the performance problem solved, and can also provide the string comparators etc.

The Wikipedia link has references to both research and tools. I strongly recommend looking at it.

Anyway, if you want to learn more, the first book (that I know of) on the subject was published earlier this year: Data Matching, by Peter Christen. Two good overview papers are Duplicate Record Detection: A Survey (Elmagarmid, Ipeirotis, and Verykios), and Overview of Record Linkage and Current Research Directions (William Winkler). I'd post links, but the anti-spam won't let me. I did a presentation on this earlier this year that gives a brief overview of the problems, research, and solutions (it's on slideshare, title "Linking data without common identifiers").

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow