Question

I want to understand the intent of the customer using his search queries, let's say if a customer is interested in yoga pants, he can either search for yoga pants or exercise pants or workout tights etc. Is there a model that I can use to find out all the search keywords that can be related to yoga pants?

Was it helpful?

Solution

I think these are the methods that you can try out (Please feel free to add more to this list):

  1. Highly precise with a little low recall is to use a dictionary with almost all possibilities (manual effort, but must be worth it.).
  2. Using Word2Vec. Mikolov has already trained text data and created word vectors. Using this vector space, you can figure out which words are similar. You can try out and find a threshold above which you can say which words are similar (for example, yoga and exercise would have decent similarity.)
  3. Train custom W2V, if you have enough data(This is an unsupervised model, so you don't need to worry about tagging the data but finding huge amounts of data relevant to the working domain.)
  4. You can use an RNN to find the most similar words in a corpus and use it for queries. This gives a bit more flexibility than W2V.

OTHER TIPS

What you are looking for is named entity recommendation. I must tell you this is an extremely tough problem. Stanford has open sourced a NER tagger but you need to train it on a large amount of data, and you will have to create a tagged dataset. Look into this medium blog for a good tutorial. I would suggest that you look into your requirements, if they are too limited then you dont need something like this. You can work through a simple vocabulary.

You could look into Rocchio's algorithm, word2vec and other methods that use co-occurence.

A simple starting point would be to query a large docment collection (your collection, the internet or a combination of both) with the query at hand, take the most prominent words in the result (td-idf result/collection) and add that to the original query. The general idea here is that the significant words in the result are words that are closely related to the query. You cn experiment in how far you need the general context (the Internet) or the domain specific context (your collection) and how you weigh the context(how much of the significant words you'll add and what their weight will be), but we can discuss that later.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top