How to match Amazon / CJ / Linkshare Products
Вопрос
I need to create a data base with Amazon, commission junction & link share API's & data feeds and then match the same products to create comparisons on product information. My problem is related to the matching process. I start by matching products via SKU/UPC/ASIN but this not perform well because many of the products doesn't contain this information. I maked some research and the most popular techniques I found are :
-Measuring cosine similarity via TF-IDF
-Measuring edit distance/ levenshtein / Jaro-Winkler
In this technique i used cosine similarity and Jaro-Winkler
How I do the matching :
Step 1 : Preprocessing
Preprocessing to transform strings into a normal form : Lowercase Filter stop words (new, by, the …) Strip whitespace replace all whitespace occurrences with a single space character
Step 2, Indexing :
Index Amazon products in a Solr core [core A] and CJ/Linkshare [core B] in an other core. The goal of indexing is to limit the number of string comparisons (via TF-IDF and Jaro-Winkler)
Step 3, matching :
- I start by retrieving a product title from core B, make a solr search in core A with this title and take the top 30 results.
- I measure similarity via TF-IDF between the product i want to match (the query) and the 30 results retrieved by solr search. I keep the products with similarity > 80%
- sort the tokens from each product alphabetically.I then compare the transformed strings with Jaro Winkler distance and keep the products with similarity > 80% (==> This perform a Jaro Winkler similarity between phrases)
- Here, I tokenize both strings (query and product to match) , and perform a comparison between tokens.
But this techniques also don't perform well. Example : Product 1 : Orange by Hugo Boss, 3 Ounce Eau de toilette Spray Product 2 : In Motion Orange By Hugo Boss Eau De Toilette Spray 3 Ounces
Product 1 and 2 are similar via this techniques but actually they are different.
How can I improve this algorithm? Is that the right way to match products? How if i train a classifier with token's weight (using Jaro Winkler) (learning data from matched products via UPC) and use this classifier to match products in a final step?
PS : I have products from different categories (health, beauty, electronics, books, movies...) and data is very unstructured or incomplete.
Any advice will be helpfull
Thanks
Smail
Нет правильного решения