Training multi-label classifier with low quality training set

https://datascience.stackexchange.com/questions/13107

16-10-2019
|

Question

So I'm creating a topics classifier where a document may be tagged for several different topics, let's say - A, B while actually the document belongs to A, B and C. In the training stage I want the classifier to learn that the document belongs to A and B but I'm not sure about class C so I don't want it to learn that the document doesn't\does belong to class C. Any ideas on how to implement such thing?

I thought about adding weights to the output labels (low weight means that there's no way the document belongs to the aspect, high weight means that the document belongs to the aspect for sure and mid weight means I'm not sure (so the penalty in this case will be lower).

Solution

You almost solved the problem in the last paragraph. Expressed more formally, your cost function could be

$$\frac{1}{N} \sum_{i,j} c_{i,j} y_{i,j} \log x_{i,j}$$

where $i$ runs over items/documents, and $j$ runs over classes, $x$ is your prediction, $y$ is the binary label (1 if item $i$ has class $j$), and $0 < c < 1$ is your confidence. This is a simple modification of the cross entropy. When the confidence $c$ is low, the value of the prediction matters less.

OTHER TIPS

Multilabel classification can seem to be a tough one in nlp. Recently there have been many techniques developed for this purpose.

ML-PA-LDA algorithm seems to be working well with multilabel approach. PA stands for presence absence where they consider the correlations because of absence of a class in a document.

You are right in last paragraph. You can use tensorflow for this purpose. Tensorflow supports having output as vector of classes.

The introductory tutorial of tensorflow for MNIST dataset uses output as vector of dimension 10 (one for each class 0..9)

https://www.tensorflow.org/tutorials/mnist/beginners/

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange