Data quantity is not low but data quality is low, what are the best practices now?

https://datascience.stackexchange.com/questions/75280

11-12-2020
|

Question

Text classification task, if data quantity is low but data quality is not low. We could use data augment methods for improvement.

But the situation is that data quantity is not low and data quality is low. (noise in the labels, or training data accuracy low)

The way I get the low quality data is by unsupervised methods or rule-based methods. In detail, I deal with a multi-label classification task. First I crawl web page such as wiki and use regex-based rule to mark the label. The model input is the wiki title and the model output is the rule-matched labels from wiki content.

La solution

If the noise is not too large, a well regularized model should perform well.

Also ensemble methods could work well, since they reduce the variance of the model. Maybe also try an ensemble with an unsupervised method like clustering, to reduce the dependency on the labels.

Otherwise there have been methods developed that handle noisy labels https://stats.stackexchange.com/questions/218656/classification-with-noisy-labels.

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange