Pergunta

I'm attempting to classify text documents using a few different dimensions. I'm trying to create arbitrary topics to classify such as size and relevance, which are linear or gradual in nature. For example:

size: tiny, small, medium, large, huge. relevance: bad, ok, good, excellent, awesome

I am training the classifier by hand. For example, this document represents a 'small' thing, this other document is discussing a 'large' thing. When I try multi-label or multi-class SVM for this it does not work well and it also logically doesn't make sense.

Which model should I use that would help me predict this linear type of data? I use scikit-learn presently with a tfidf vector of the words.

Foi útil?

Solução

If you want these output dimensions to be continuous, simply convert your size and relevance metrics to real-valued targets. Then you can perform regression instead of classification, using any of a variety of models. You could even attempt to train a multi target neural net to predict all of these outputs at once.

Additionally, you might consider first using a topic model such as LDA as your feature space.

Based on the values, it sounds like the "relevance" might be a variable best captured by techniques from sentiment analysis.

Licenciado em: CC-BY-SA com atribuição
scroll top