Classifying text documents using linear/incremental topics
-
16-10-2019 - |
Pergunta
I'm attempting to classify text documents using a few different dimensions. I'm trying to create arbitrary topics to classify such as size and relevance, which are linear or gradual in nature. For example:
size: tiny, small, medium, large, huge. relevance: bad, ok, good, excellent, awesome
I am training the classifier by hand. For example, this document represents a 'small' thing, this other document is discussing a 'large' thing. When I try multi-label or multi-class SVM for this it does not work well and it also logically doesn't make sense.
Which model should I use that would help me predict this linear type of data? I use scikit-learn presently with a tfidf vector of the words.
Solução
If you want these output dimensions to be continuous, simply convert your size and relevance metrics to real-valued targets. Then you can perform regression instead of classification, using any of a variety of models. You could even attempt to train a multi target neural net to predict all of these outputs at once.
Additionally, you might consider first using a topic model such as LDA as your feature space.
Based on the values, it sounds like the "relevance" might be a variable best captured by techniques from sentiment analysis.