Evaluate document similarity / content-based recommender system

https://datascience.stackexchange.com/questions/85830

16-12-2020
|

Pergunta

I'm planning on building a basic content-based recommender system with word2vec and cosine similarity. The data consists of 300k documents in varying length.

How do I evaluate my model if I have no labels / categories whatsoever?

Solução

If you're trying to create a content-based document recommender system, you want to measure success via some sort of ranking metric like precision@k.

But since you don't have user-document interaction histories, you're either going to have to make them yourself, or just do a bunch of document queries and see if they make sense.

If you're going to make user-document interaction histories yourself, I would just do 10-20 queries and go through the first 5 documents that get returned and label whether or not they match. Calculate precision@k for those results and now you have an idea of how you're doing.

Not sure if you're familiar with ranking metrics but the best way to look at it is to always compare to some baseline model. In your case, I would calculate precision@k for BoW, tfidf, LSA, and LDA with cosine similarity as other models to compare to.

Unfortunately not a ton of other options for the task of content-recommendation without interaction data to test on. But I also would add that just eyeballing the results a lot of the time will tell you how the model is doing.

Outras dicas

When you don't have label/categories then it's called Unsupervised Learning. You can solve this problem via Latent Dirichlet Allocation (LDA) model and then evaluate your model by splitting the texts in half and compare the topic assignment for each half using cosine similarity. The more similar the topic assignment, the better.

Example

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange