SMOTE oversampling for class imbalanced dataset introduces bias in final distribution

https://datascience.stackexchange.com/questions/77142

12-12-2020
|

Pergunta

I have a problem statement where percentage of goods (denoted by 0) is 95%, and for bads (denoted by 1) it is 5% only. One way is to do under sampling of goods so that model understands the patterns properly for both the segment. But going with under sampling is leading to high loss of data which will directly lower down my model performance. Hence I have opted for over sampling of bads, but over sampling has its own problem too:

Check this code snippet:

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 33)

x = train_data.drop(['target'], axis = 1)
y = train_data[['target']]

x_new, y_new = sm.fit_sample(x, y)

y.target.value_counts()     # 0 ->  26454  1-> 2499
y_new.target.value_counts() # 0 ->  26454  1-> 26454

after oversampling, I get my equal no. of goods and bads, but the problem is that variable distribution is getting affected.

For e.g. I have 'age' variable, in case of good, the bucket wise variable distribution is

 1 - 25 years - 20%
26 - 50 years - 35%
50+     years - 45%

and distribution of bad is (Before OverSampling):

 1 - 25 years - 50%
26 - 50 years - 30%
50+     years - 20%

But post oversampling the distribution of bads is changing:

 1 - 25 years - 40%
26 - 50 years - 35%
50+     years - 25%

So now the distribution of this variable for Good Vs Bad is not that much effective as it was earlier (before oversampling).

Is there any way that doing oversampling does not affect my variable distribution?

Solução

Class imbalance is a frequent problem in machine learning and techniques to balance the data usualy are of two flavors: undersampling the majority, oversampling the minority or both.

One can always partition the data according to some variables and separately oversample each partition so as to maintain some measure (eg given data distribution). In the same way that separate oversampling can be achieved for only $1$ variable, in the same way separate oversampling can be achieved for $n$ variables. Of course more complex but certainly doable. For example one takes all distinct combinations of variables (or ranges of variables for continous variables) and separately oversamples each such cluster in order to maintain the given data distribution.

The above is a straightforward technique, although one should note that if minority class does not have enough samples there is no guaranty that the given data distribution reflects the (true) underlying data distribution (in other words it may not constitute a representative sample in statistical sense). So for these cases oversampling the whole data, without extra assumptions about underlying distribution, is a maximally unbiased method in the statistical sense.

There is some research lately on hybrid and intelligent methods for (oversampling) class imbalance problems without introducing bias during the process. The following references will provide the relevant background:

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches, October 2018

Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates. We emphasize and distinguish overoptimism from overfitting, showing that the former is associated with the cross-validation procedure, while the latter is influenced by the chosen oversampling algorithm. Furthermore, we perform a thorough empirical comparison of well-established oversampling algorithms, supported by a data complexity analysis. The best oversampling techniques seem to possess three key characteristics: use of cleaning procedures, cluster-based example synthetization and adaptive weighting of minority examples, where Synthetic Minority Oversampling Technique coupled with Tomek Links and Majority Weighted Minority Oversampling Technique stand out, being capable of increasing the discriminative power of data

Learning from Imbalanced Data, 9, SEPTEMBER 2009

With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of under represented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem, 14 February 2020

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

Hope these notes help.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange