Question

I have a huge database with product-names. Before inserting a new product into the database I'd like to match the product onto the database to find out, if it already exists (i.e. get the IDs of the entries which are the same or very very similar) which are the same, but have a different description, e.g.:

  • iphone 4s
  • i-phone 4s
  • iphone-4s

I don't need to automatically match those entries, I only want to generate matching-suggestions and then let them be supervised.

I have some ideas about it. Regard ONE single product name for which I'd like to find the relating entry in the database, e.g. "apple iphone-4s". My DB could look like:

  1. iphone-4s
  2. galaxy s4
  3. iphone 3g
  4. apple nano
  5. samsung anything 4s

  1. Replace special chars like "-", "," etc with a space (apple iphone-4s -> apple iphone 4s), then explode the string, making it to array('iphone', '4s'), then looping over each entry in this array and match it to one product-name from the database and count the total number of hits. Results: Matching apple iphone 4s <=> array('apple', 'iphone', '4s') to

    • iphone-4s gives 2 hits
    • galaxy s4 gives 0 hits
    • iphone 3g gives 1 hit
    • apple nano gives 1 hit
    • samsung anything 4s gives 1 hit
  2. sort those matches for the most hits, i.e. iphone-4s is the most likely match to suggest to the supervisor.

  3. Maybe as addition it would make sense to remove all spaces and special chars from the names already stored in the database, because of the following scenario: My new product name could be apple iphone and the stored database name would e.g. be apple i-phone. So there would be only one hit instead of two. Removing every non-alphanumeric character from the already stored one would possibly increase the hitrates. In this example, the stored database entry would become appleiphone, so after exploding the new productname apple iphone, there would be two hits.
  4. As yet another addition I thought of possibly removing stuff like colors etc from all names before matching them as I don't care about them and I'd like to match two products no matter which color they have...

Do you have better ideas?

Was it helpful?

Solution

You may want to consider levenshtein distance function:

http://www.php.net/manual/en/function.levenshtein.php

This is what natural text search engines use to get you similar results to the words you type in. I don't know how you can support this in mysql, but I know I used this quite well with solr indexes. Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top