Question

I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc.

Is there some research in this area or is there only counting how many times some relevant words appear?

Was it helpful?

Solution

The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting.

Popular algorithms include Naive Bayes and (linear) SVMs.

For this approach, you'll need labeled training data, i.e. documents annotated with relevant themes.

See, e.g., Introduction to Information Retrieval, chapters 13-15.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top