Handling large imbalanced data set

https://datascience.stackexchange.com/questions/20573

30-10-2019
|

Pergunta

I have an imbalanced data set consisting of some 10's of millions text strings, each with thousands of features created by uni- and bigrams, and additionally I have also the string length and entropy of string as features.

It is a multiclass data set (40-50 classes), but it is imbalanced. Some classes can be 1000x smaller compared to the largest class. I have restricted the data to 1 million strings per class as maximum, otherwise the imbalance could be even larger.

Because of this I want to use over-sampling to improve the data for the underrepresented classes. I have looked into ADASYN and SMOTE from the python imblearn package. But when I run it the process eats up all my RAM in the swap memory, and soon after the process gets killed. I assume because the memory is not enough.

My question is now how to best proceed. Obviously my data is too large to be over-sampled as it is. I have thought of two options, but I cannot make out which is the most "correct".

I sent in only one underrepresented class and the largest class, and repeated this for each underrepresented class. I am not sure if this could mean that classes might start to overlap though.
I instead under-sample the data, maybe down to 100k samples per class. This might reduce the data enough such that I can run oversampling on the less represented classes (with 1k-10k samples).

Any other options that are more appropriate that I have missed?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange