Question

Having multiple sets of categories from different listings sites (e.g. Yelp, yellowpages.com, Google My Business...). I want to figure out what category X on one site is on another site.

We have hundreds of thousands of businesses and the categories on all the sites they are on, so we could see that "Galbi Foo Restaurant" is in category "Restaurants > Korean" on one site and "Restaurants" on the other.

Some examples category mappings that will have to happen:

  • Nail Salons = Manicure & Pedicure
  • Eyelash Service = Visagist
  • Tanning = Sunbed Salon
  • Specialty Food = Grocery (Specialty Food child node doesn't exist)
  • Diagnostic Imaging = Radiologist

Where would I start to solve this? It seems like a classification (logistic regression) problem. But this ML stuff hasn't clicked with me yet, so I'm likely to drastically over or under complicate these things :).

Was it helpful?

Solution

This sounds like a pretty standard supervised learning problem. In this case, your records would be businesses on site X and their actual category on site Z. Your predictors would be tags/categories for a particular business on site X, and your target variable, y (i.e., what you're trying to predict), would be the category on the other website. As far as the code goes, you have a variety of options depending on your preferred language. You could use the caret package in R, the scikit-learn library in Python, or the Weka library (maybe even Spark's ML lib because of its simplicity) in Java/Scala.

Side note, in your question I think you meant to say "logistic regression" instead of "logical regression". You don't need to use logistic regression (although it wouldn't hurt). You could also try algorithms like Random Forests or Naive Bayes.

Also worth noting: your target variable will have many classes (ie every possible category for the site you're trying to predict), so don't get alarmed if it seems like there are a lot of classes. That's normal for a problem like the one you've described.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top