How to best handle imbalanced text classification with Keras?

https://datascience.stackexchange.com/questions/73962

11-12-2020
|

Question

I implemented a text classification model using Keras.

Most of the datasets that I use are imbalanced. Therefore, I would like to use SMOTE to handle said imbalance.

I tried both on plain text, and once the text was vectorized, but I don't seem to be able to apply SMOTE on text data.

I use imblearn and received the following error:

Expected n_neighbors <= n_samples,  but n_samples = 3, n_neighbors = 6

How can I fix this error? And is SMOTE a good idea? If not, what other ways could I deal with class imbalance?

Solution

First of all, to reassure you, SMOTE should work on text data. SMOTE will work on any data type as long as there is a way to compute the distance between data points.

Based on the error message you receive, it seems that it's an implementation issue (adding part of your code or how much data you have would greatly help).

As the error states, you have only 3 samples but the method requires at least 6. My guess is that something went wrong and you should have much more than 3 samples.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange