Data Architecture: Deduplication of product catalogs
-
02-07-2021 - |
Domanda
I'm thinking out my strategy for merging (and de-duplicating) multiple catalogs of products.
I'll be using a no-sql database, and need to query N catalogs of partially overlapping products.
Certain aspects such as categorization, tags, descriptions, etc need to be normalized, and I need to track what catalogs contain each unique item (de-duplication of products in each catalog, by UPC for example).
My current thought is to import the individual catalogs into their own tables, then use self-built algorithms to identify "similar" items, perform normalization, then create a final "Master" table which contains the normalized & de-duplcated data - (the master-record values would be copied from whichever catalog or mix of catalogs it's chosen from & contain links to which catalogs contain that item).
I wonder what other thoughts exist on the subject? What areas of research should I look into to better educate myself?
Soluzione
You didn't supply a lot of details but from what I understand, if you'd be using HBase you can do the following:
- write all the data into hbase in the original formats or close to that
write a map/reduce to sort things out:
2.1. in the map phase normalize and emit the potential keys
2.2. int he reduce phase (where you get all the records with the same key) produce the master record
- export master record to where you'd like
Altri suggerimenti
There are some local companies here that generate Sql schemas nightly/weekly from NoSql for reporting purposes.
From what I understand the approach used is by them is exactly how you have described. I believe that your data set is very large you shouldn't have any issues with that strategy.
This has been a huge area of research since the 1940s (no, honestly) under the name record linkage (but unfortunately it's also known by many other names, like "identity resolution", "data matching", "merge/purge", etc etc). There's a huge amount to learn here, and people have developed lots of techniques and tools that you can use. I would strongly recommend that you familiarize yourself with these before trying to write something yourself.
Note that a key problem is going to be performance. You basically have to compare all record pairs (which is O(n^2)), and you should ideally use fuzzy string comparators (which are all slow). That alone is a good reason for using a tool that already has the performance problem solved, and can also provide the string comparators etc.
The Wikipedia link has references to both research and tools. I strongly recommend looking at it.
Anyway, if you want to learn more, the first book (that I know of) on the subject was published earlier this year: Data Matching, by Peter Christen. Two good overview papers are Duplicate Record Detection: A Survey (Elmagarmid, Ipeirotis, and Verykios), and Overview of Record Linkage and Current Research Directions (William Winkler). I'd post links, but the anti-spam won't let me. I did a presentation on this earlier this year that gives a brief overview of the problems, research, and solutions (it's on slideshare, title "Linking data without common identifiers").