Pergunta

How can I model topics in the results returned by a search engine with higher weightage to documents ranked higher in the result set?

The use case that I am looking at involves extracting the most significant topics returned in the search results.

Eg. If the user searches for a query q1 which returns documents D1...Dn with scores S1...Sn (in descending order) then I propose the notion that the theme of such a set of documents is represented better by documents scored higher in the result set.

Is it possible to incorporate this information in to topic modelling algorithms like LDA?

Foi útil?

Solução

If you assume that the documents returned for a query already share a common "topic", then what's the point of using topic modeling? You already have a kind of "topical subset" of the documents.

However if the goal is to clean up the results so that documents which are "topically similar" don't appear together at the top, i.e. to favor "topical diversity" in the top resulting documents, then this could be done in the following way:

  • train a topic model for all the documents offline. You obtain a vector of posterior probabilities p(T|D) for every topic T and document D.
  • for every query, modify the scoring method so that there is for instance a minimal topic distance between documents ranked at the top (this part might be tricky to get right but it's feasible).

Incidentally, this approach is also more realistic because it would be quite inefficient to train a full topic model for every particular query.

Licenciado em: CC-BY-SA com atribuição
scroll top