Question

Apologies if this is naive, I am fairly new to the domain. I have a requirement where I am trying to classify 2 types of text data, i.e, I have got 2 classes to classify my data upon. I am able to get acceptable results for them using word vectors, dimensionality reduction and then putting the data through to LinearSVC for classification. But my model is biased towards sentences with longer lengths. I know tfidf can help in this, but is there a way to apply that with word vectors, I dont want to lose its ability to predict on unseen similar data?

A follow up problem to this is, when there is data from some other class apart from the 2 classes that I have trained my data upon. I would like my model to be able to predict that the data doesn't belong to any of the 2 classes. Currently, it just predicts 1 of the 2 classes randomly.

I was thinking if I should first put my data through topic modelling, which would give an idea on what topic the text data is. Based on keywords from topic modelling, we can detect if my text data belongs to one of the 2 classes or not and then put it to my classification model for final prediction. But this doesn't seems very clean and I can see it failing as there's too much dependency on keywords generated by topic modelling. Is there any other finer and better way to do this?

Was it helpful?

Solution

This is called an open-class text classification problem, it's used in particular for some author identification problems. I don't have any recent pointers but from a quick search I found this article: https://www.aclweb.org/anthology/N16-1061.pdf

In the field of author classification there is a similar problem called author verification, which can be treated as a one-class classification problem. You could consider using it in this way:

  1. one-class classification between "known classes" vs. others
  2. regular classification between the known classes

OTHER TIPS

Apart from your desired two classes, relabel all other classes as a third class and then train your model on a three class classification problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top