Question

Let's say I need to classify addresses with scikit-learn, so if I want my classifier to be able to classify addresses by the street name, and post/zip code, should I do a OneVsRest classifier, or separate them into two different classifiers (for the same training set)?

I have tried both, and it seems like having multiple classifiers might be a better choice, as it feels faster to train multiple smaller classifiers. Is this how it is supposed to be done?

Was it helpful?

Solution

Both ways are valid and both are commonly used. Sometimes, a classifier that claims to be multilabel may just be separating the labels into multiple OneVsRest classifiers under-the-hood and conveniently joining the results together at the end.

However, there are cases where the methods are fundamentally different. For instance, in training a neural net with multiple targets (labels), you can setup the structure of the network such that there is shared structure. The shared nodes will end up learning features that are useful for all the targets, which could be very useful.

For example, if you're classes (labels) are "cat-pet", "cat-big", and "dog", you may want an algorithm that first learns to distinguish between any cat and any dog, and then in a later step learns to separate cats that are pets from cats that are big (like a lion!). This is called hierarchy, and if your classifier can exploit hierarchy you may gain better accuracy. If your classes are completely independent however, it may not make any difference.

I suggest you start with the method that is easiest (i.e. OneVsRest), and see if the performance is suitable to your needs, then move to more complicated methods (multilabel, hierarchical methods, etc) only once you need better performance.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top