Question

Our site has user-generated content and a user can use hashtags to categories their content. To make searching for content easier, we are thinking about creating "Interest" categories like:

Sex, Hobbies, Current Events, etc.  

One way to achieve this would be to associate keywords with each interest category. So, if a user clicks on Hobbies, the system will search for the keywords we've associated with Hobbies like:

Hobbies -> cars, cooking, reading, etc.  

However, this method seems limited since a user can post a picture of a hotrod with the words "sexy" in the body and with our system the word "sexy" is associated with two interest categories: "Sex" and "Fashion & Beauty".

Any suggestions on how to make this method smarter? Or, suggestions/advice on how companies would implement something like this?

Was it helpful?

Solution

Probably you should weight the categories. Find all the matching words, and assign a value to all categories as follows:

  • Add 3 for every word that is undoubtfully belongs to that category
  • Add 1 for every word that may belong to more categories

It is a biased weighting (towards unique words), this way you can better decide where the pictures belong to.

Also, you can build a - continuously changing - weight-matrix, that which word is how relevant to a certain category. The frequent words bear less importance (because everybody is using them).

Also, based on the categorized texts, you can automatically extend the word-list, and automatically categorizing them. For example, if a new game name appears in the word-list (call it 'abc'), you will notice that 'abc' appears in a lot of texts in the hobby category, and nowhere else. So, you can tie this word to this category.

It's a very exciting area to build auto-learning systems!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top