Pergunta

Basically, the idea is to have users following tags on the site, so each users has a set of tags they are following. And then there is a document collection where each document in the collection has a Title, Description, and a set of tags which are of relevance to the topic being discussed in the document as determined by the author. What is the best way of recommending documents to the user, given the information we have, which would also take into consideration the semantic relevance of the title and description of a document to the user's tags, whether that be a word embeddings solution or a tf-idf solution, or a mix, do tell. I still don't know what I'm going to do about tag synonyms, it might have to be a collaborative effort like on stackoverflow, but if there is a solution to this or a pseudo-solution, and I'm writing this in C# using the Lucene.NET library, if that is of any relevance to you.

Foi útil?

Solução

If I got your problem description correct, you are looking for a recommender system like for example used by Netflix or Amazon. State of the art solution would be to use Latent Dirichlet Allocation topic modeling to make recommendations based on topics (in your case, topics would be the tags). Here is a very good video tutorial on this topic: https://youtu.be/3mHy4OSyRf0

In the case of the standard version of LDA, you don't even have to define the tags, you just define a value of different tags among all your documents. If you have for example 10000 documents and you want to use 100 different tags, the method will transform your words/documents matrix into a topics/documents matrix.

The entries of the words/documents matrix are simply all documents as columns and all words (from all your documents) as rows, then for each document you have the counts of each word.

The entries of the topics/documents matrix are all documents as columns and all possible topics as rows, then for each document you have entries like 78% topic1, 12.5% topic95, 0% topic99 on each topic.

Once you have this data and you want to recommend a new document to a user based on his interests(tags), or in other words you have a user_interests vector $\vec{u}$ with 100 entries which have values between 0 and 1, and you have topics/documents matrix $M_{{topics}\times{documents}}$ you calculate a new matrix by multiplaying $M_{{topics}\times{documents}}*\vec{u}$, from this matrix you calculate the sum from each row and recommend those documents with the highest sum.

If you just want to use predefined tags, you can skip the step where you use the LDA method to calculate the topics/documents matrix and simply use your data to represent your documents as tags/documents matrix and your users as tag_vectors, proceeding from here the same as above: multiplying the matrix with a user_vector, calculating the sum from each row and recommending the documents with the highest sum.

Outras dicas

One solution could be to train a single embedding space, StarSpace is one such implementation. That single embedding space would contain all users, documents, and tags. Then it is a nearest neighbor search to recommend any combination. Given a user, find the nearest documents. Given a tag, find the nearest tags …

For new entities (i.e., users, documents, or tags), split the individual entity into parts. For example, a document will have tokens or tag will be associated with documents that have tokens. Then find the average embedding of all the parts. That location in the embedding space is the approximately semantic meaning of new entities.

Overall, this is a complex, open-end problem, thus there are many possible ways to create useful solutions. The next best solutions depend on what has already been done and what the next feature that would add the most value to end-users.

Licenciado em: CC-BY-SA com atribuição
scroll top