Question

In general, what sort of supervised algorithms and techniques should I use on data that has the following charactersitcs:

  • 2 potential classification labels?

  • 3-5 potential classification labels?

  • 6-10 potential classification labels?

  • 10-50 potential classification labels?

  • 50 or more potential classification labels?

My Primary Questions:

  1. What algorithms learn most effectively at these different tiers of total possible class labels?
  2. What algorithms generally make the best predictions with the smallest amount of data at each of these tiers?

I know at some point it makes more sense to use regression rather than a classifier. How many potential class labels would this be?

Was it helpful?

Solution

There are many factors which influence choice of classifier algorithm. Number of target classes does not generally have an influence, compared to the nature of input features.

As one example, if your input data is natural audio or image, then regardless of the number of classes, a deep convolutional neural network is very likely going to have the best performance.

What algorithms generally make the best predictions with the smallest amount of data at each of these tiers?

There is no a priori best approach based on number of output classes. The "best predictions" vs "smallest amount of data" is also a trade-off, where simpler models will perform better than complex ones on small amounts of data, but the more complex models will cope better with larger amounts of data, and will then give you better predictions. At some point you might have enough data so that sampling more will not improve your trained models, but you need to establish that empirically.

Most algorithms allow you to explore the trade-off within them by varying hyper-parameters to make the model simpler for smaller data sets and more complex when there is more training data.

I know at some point it makes more sense to use regression rather than a classifier. How many potential class labels would this be?

That is not strictly true. In general, the distinction between classifying and regression is a hard line. If you are classifying hand-written symbols into an alphabet for instance, it doesn't really matter if you are doing this for 10, 100 or 1000 classes, there is not a practical point at which the symbols turn from being a set of objects into a continuous space to perform regression over.

It could be true if your target class represents a range within some continuous variable (e.g. classifying an event by some of its properties into which year it occurred). But in that case the problem is inherently a regression problem to start with. In fact you may be better off training a regression algorithm in this case even for small number of target classes, and simply binning the predictions into relevant classes.

It could also be true that your target class represents a rank or sequence within an ordered set. In which case this does look more like a regression problem when you have a longer sequence. In general, if you can arrange your target classes into a meaningful sequence, then you might be able to perform some kind of ordinal regression which could be a better choice than using a classifier. However, classifying symbols/alphabets does not work this way, because the sequence in those is arbitrary.

Finally, you might be facing such a large number of classes that a single classifier model is overwhelmed and you need to approach the problem differently.

For an example of this last case, consider a classifier for images of pets. If it had three classes (cats, dogs, rabbits), then you'd clearly use standard classification approach. Even when classifying by breed - 100s of classes - then this approach still works well enough, as seen in ImageNet competitions. However, once you decide to try and detect the identity of each pet (still technically a class) you hit a problem using simple classifier techniques - in that case the structure of the solution needs more thought. One possible solution is a regression algorithm trained to extract biometric data from the image (nose length, distance between eyes, angle subtended between centre of jaw and ears) and move the classification stage into KNN based on a database of biometric data for observed individuals. This is how some face identifying algorithms work, by mapping images of faces into an easy-to-classify continuous space first (typically using a deep CNN), then using a simpler classifier that scales well across that space.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top