Question

I need to create a data base with Amazon, commission junction & link share API's & data feeds and then match the same products to create comparisons on product information. My problem is related to the matching process. I start by matching products via SKU/UPC/ASIN but this not perform well because many of the products doesn't contain this information. I maked some research and the most popular techniques I found are :

-Measuring cosine similarity via TF-IDF

-Measuring edit distance/ levenshtein / Jaro-Winkler

In this technique i used cosine similarity and Jaro-Winkler

How I do the matching :

Step 1 : Preprocessing

Preprocessing to transform strings into a normal form :  Lowercase  Filter stop words (new, by, the …)  Strip whitespace  replace all whitespace occurrences with a single space character

Step 2, Indexing :

Index Amazon products in a Solr core [core A] and CJ/Linkshare [core B] in an other core. The goal of indexing is to limit the number of string comparisons (via TF-IDF and Jaro-Winkler)

Step 3, matching :

  1. I start by retrieving a product title from core B, make a solr search in core A with this title and take the top 30 results.
  2. I measure similarity via TF-IDF between the product i want to match (the query) and the 30 results retrieved by solr search. I keep the products with similarity > 80%
  3. sort the tokens from each product alphabetically.I then compare the transformed strings with Jaro Winkler distance and keep the products with similarity > 80% (==> This perform a Jaro Winkler similarity between phrases)
  4. Here, I tokenize both strings (query and product to match) , and perform a comparison between tokens.

But this techniques also don't perform well. Example : Product 1 : Orange by Hugo Boss, 3 Ounce Eau de toilette Spray Product 2 : In Motion Orange By Hugo Boss Eau De Toilette Spray 3 Ounces

Product 1 and 2 are similar via this techniques but actually they are different.

How can I improve this algorithm? Is that the right way to match products? How if i train a classifier with token's weight (using Jaro Winkler) (learning data from matched products via UPC) and use this classifier to match products in a final step?

PS : I have products from different categories (health, beauty, electronics, books, movies...) and data is very unstructured or incomplete.

Any advice will be helpfull

Thanks

Smail

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top