When should we consider a dataset as imbalanced?

https://datascience.stackexchange.com/questions/11788

16-10-2019
|

Pergunta

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced.

My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in the dataset.

Examples:

If the number of positive examples is 1,000 and the number of negative examples is 10,000, should I go for training my classifier on the full dataset or I should subsample the negative examples?
The same question for 1,000 positive example and 100,000 negative.
The same question for 10,000 positive and 1,000 negative.
etc...

Solução

I think subsampling (downsampling) is a popular method to control class imbalance at the base level, meaning it fixes the root of the problem. So for all of your examples, randomly selecting 1,000 of the majority of the class each time would work. You could even play around with making 10 models (10 folds of 1,000 majority vs the 1,000 minority) so you will use your whole data set. You can use this method, but again you're kind of throwing away 9,000 samples unless you try some ensemble methods. Easy fix, but tough to get an optimal model based on your data.

The degree to which you need to control for the class imbalance is based largely on your goal. If you care about pure classification, then imbalance would affect the 50% probability cut off for most techniques, so I would consider downsampling. If you only care about the order of the classifications (want positives generally more higher than negatives) and use a measure such as AUC, the class imbalance will only bias your probabilities, but the relative order should be decently stable for most techniques.

Logistic regression is nice for class imbalance because as long as you have >500 of the minority class, the estimates of the parameters will be accurate enough and the only impact will be on the intercept, which can be corrected for if that is something you might want. Logistic regression models the probabilities rather than just classes, so you can do more manual adjustments to suit your needs.

A lot of classification techniques also have a class weight argument that will help you focus on the minority class more. It will penalize a miss classification of a true minority class, so your overall accucracy will suffer a little bit but you will start seeing more minority classes that are correctly classified.

Outras dicas

Imbalance is not defined formally but a ratio of 1 to 10 is usually imbalanced enough to benefit from using balancing technique.

There are two type of imbalance, relative and absolute. In the relative the ratios between the majority and minority classes are imbalanced. In absolute you also have a small number of minority samples. The higher the imbalance ratio, the more likely you will reach absolute imbalance too.

Please note that straight forward subsampling is not an optimal way to cope with imbalanced dataset. That is because you should build a classifier that will perform well on your original dataset. For a technique for building classifiers on imbalanced datasets see here. For evaluating your classifier see here.

Data imbalance problem ?? In theory, it is only about numbers. Even if the difference is 1 sample it is data imbalance

In practical, saying this is a data imbalance problem is controlled by three things: 1. The number and distribution of Samples you have 2. The variation within the same class 3. The similarities between different classes

The last two points change how we consider our problem.

To explain this let me give an example: Class A = 100 samples Class B = 10 000

If the variation within class B is very low Then down sampling will be enough , there is no data imbalance problem

If the variation is very high within class b , then down sampling may lead to information loss And it is dangerous to apply down sampling

Another point, having a lot of samples ( mainly for the minority class) will relax the data imbalance problem and make it easier to deal with

E.g. 10 : 100. Vs. 1000 : 10 000

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange