Using SMOTE for Synthetic Data generation to improve performance on unbalanced data

https://datascience.stackexchange.com/questions/47228

01-11-2019
|

Question

I presently have a dataset with 21392 samples, of which, 16948 belong to the majority class (class A) and the remaining 4444 belong to the minority class (class B). I am presently using SMOTE (Synthetic Minority Over-Sampling Technique) to generate synthetic data, but am confused as to what percentage of synthetic samples should be generated ideally for ensuring good classification performance of Machine Learning/Deep Learning models.

I have a few options in mind:- 1. The first option is to generate 21392 new samples, with 16904 majority samples of class A and remaining 4488 minority samples of class B. Then, merge the original and synthetically generated new samples. However, the key drawback I believe is that the percentage of minority samples in my overall dataset (original+new) would remain more or less the same, which I think defeats the purpose of oversampling the minority samples. 2. The second option is to generate 21392 new samples, with 16904 majority and remaining 4488 minority samples. Then, only merge the original data with the newly generated minority samples of the new data. This way, the percentage of minority (class B) samples in my overall data would increase (from 4444/21392 = 20.774 % to (4444+4488)/(21392+4488) = 34.513 %. This I believe is the purpose of SMOTE (to increase the number of minority samples and reduce the imbalance in the overall dataset).

I am fairly new to using SMOTE, and would highly appreciate any suggestions/comments on which of these 2 options do you find better, or any other option which I may consider alongside.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange