Question

We've got a list of approximately 18,000 product names (they're from 80-90 sources, so quite a few that are similar but not duplicates - these were picked as DISTINCT from a table) unfortunately there are different ways of expressing these names. We have to try and normalize the dataset so we present our users with more meaningful names.

For example, a list like this:

Canon EOS 5D Mark III
Canon EOS 5D mk III
Canon EOS 5DMK3
Canon EF 70-200mm f/2.8L IS II USM Lens
Canon EF 70-200mm f/2.8L IS II USM Telephoto Zoom Lens
Canon EF 70-200mm f/2.8L IS USM Lens
Canon EF 70-200mm f/4L USM Lens

I'd like to be able to assess those strings and collapse them into something like this:

Canon EOS 5D Mark III
Canon EF 70-200mm f/2.8L IS II USM 
Canon EF 70-200mm f/2.8L IS USM Lens
Canon EF 70-200mm f/4L USM Lens

But I'd like to know how similar two strings are to be able to determine this. I do realise that the F2.8 IS II and IS USM may be a bit hard, but thought I'd throw it in.

The real product names are far less exciting (they're parts for the farm equipment we stock).

We also store these names in a Postgres (9.5) database table. Examples i've seen compare two lists, but we don't have a master product list to do that unfortunately.

Was it helpful?

Solution

Your problem is known as detection of near-duplicate documents, i.e. you have strings that are similar but not exact duplicate. The most common approaches are using cosine similarity and Jaccard similarity. You could check this page for more information

First you have to convert your strings to a vector of features, this features could be tf-idf vectors of all the tokens (words) that appear in the database, could also be a vector of n-grams. For a discussion of semantic and syntactic features you can look here.

Finally you would have to check the similarity of every pair of documents, in your case 18000. The brute force approach is O(n^2), so this could be unfeasible. A common technique to deal with this issue is to use fingerprinting (hashing) together with Locality Sensitive Hashing (LSH).

You can find an introduction and a general discussion of the whole topic in Chapter 3 of Mining Massive Datasets

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top