How to approach an extremely unbalanced time series dataset

https://datascience.stackexchange.com/questions/73989

11-12-2020
|

Question

I need to classify a relatively small time series dataset.

Training set dimensions are 5087 rows (to classify) by 3197 columns (time samples) which are (or should be as far as I understood) the features of the model. I don't know yet if every sample is important and I will think about downsample/filtering/fourier transform later.

Unfortunately dataset is extremely unbalanced: only 37 (0.7%) out of 5087 rows are "Positive". How would you approach this? I will have to use Scikit-learn library.

Since this is my first approach with Scikit-learn I wanted to try a very simple classifier, with few hyperparameters,and build up from there.

First, choosing the classifier: logistic regression because is the easiest I can think of an this is just a test. Second, choosing regularization parameter via tuning grid Third, choosing the splitting cross validation strategy: I wanted to use stratified bootstrap but unfortunately it is not provided by the library so I opted for Stratified shuffle split Fourth, choosing the metric: cohen's kappa because the dataset is so unbalanced the make accuracy too much biased

Script:

classifier = LogisticRegression(tol=1e-4, max_iter=500, random_state=1)
param_grid = {'C': list(range(3))}
splitter = StratifiedShuffleSplit(n_splits=5, random_state=1)
grid_searcher = GridSearchCV(classifier, param_grid, cv=splitter, scoring=make_scorer(cohen_kappa_score))
model = grid_searcher.fit(train_x, train_y)

First is "cv=splitter" legit? Second, what do you think of this approach? Obviously with such a trivial classifier the model predicted all Negative and I also got some warnings:

FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan ZeroDivisionError: float division by zero

Solution

Your dataset is extremely unbalanced, and most of the models would just ignore these 37 samples. After all, failing 0.7% of any test seems to be an extremely good result!

There are several ways to address the imbalanced dataset. I suggest two options: (1) Assign a very high penalty on misclassification of positive samples -- your loss function would be weighted L2, (2) Resampling -- when you draw a random row, assign a higher probability to get the positive sample than negative.

See, for example, https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets for implementation. And How to deal with class imbalance in a neural network?

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange