Making a labelled training data set

https://datascience.stackexchange.com/questions/5969

16-10-2019
|

Question

We are developing a classification system, where the categories are fixed, but many of them are inter-related.

For example, we have a category called, "roads" and another one called "traffic". We believe that the model will be confused by the text samples, which could be in roads category and also in traffic.

Some of our text samples are suitable for multi class labelling too. For example, "There is a garbage dump near the footpath. The footpath is broken completely". This text could be categorized into garbage bucket or footpath bucket.

We are going to build a training set for this classifier, by manually annotating the text. So, can we put multiple labels for one issue? How should we deal with text with multiple labels for it? Should they be added into all categories to which it is tagged to, as training sample ?

For example, "There is a garbage dump near the footpath. The footpath is broken completely". This text could be categorized into garbage bucket or footpath bucket. So, should this text be added as a training sample for garbage and footpath? How should we consider the labels?

Can you please give your insights?

Solution

Generally with multiple classes you have to make a distinction between exclusive and inclusive groups. The simplest cases are "all classes are exclusive" (predict only one class), and "all classes are compatible" (predict list of classes that apply).

Either way, label the classes as you would want your trained model to predict them. If you expect your classifier to predict an example is in both garbage and footpath, then you should label such an example with both. If you want it to disambiguate between them, then label with a single correct class.

To train a classifier to predict multiple target classes at once, it is usually just a matter of picking the correct objective function and a classifier with architecture that can support it.

For example, with a neural network, you would avoid using a "softmax" output which is geared towards predicting a single class - instead you might use a regular "sigmoid" function and predict class membership on a simple threshold on each output.

You can get also more sophisticated perhaps with a pipeline model if your data can be split into several exclusive groups - predict the group in the first stage, and have multiple group-specific models predicting the combination of classes in each group in a second stage. This may be overkill for your problem, although it may still be handy if it keeps your individual models simple (e.g. they could all be logistic regression, and the first stage may gain some accuracy if the groups are easier to separate).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange