What's a good F1-score in (not) extremely imbalanced dataset?

https://datascience.stackexchange.com/questions/65094

20-10-2020
|

Question

I have a dataset with around 4.7K focused on binary classification. Class proportion is 33:67. meaning Label 1 is 1558 (33%) and Label 0 is 3154 (67%) of my dataset.

Is my dataset imbalanced? some people say it is not bad

My objective is to increase the F1-score only. I set the class_weight=balanced in my parameters and scoring=f1 during CV as shown below.

svm=SVC(random_state=42)
svm_cv=GridSearchCV(svm,param_grid,cv=5,scoring='f1')
svm_cv.fit(X_train_std,y_train)

Can you let me know through a code sample on how I can increase the weightage to minority class? if that is any different from choosing balanced paramter

Currently my results are like as follows

I understand AUC for few algo is above 80 but I believe F1-score is more important for a imbalanced class problems like mine.

Can you help? I tried Oversampling minority class but not much improvement.

Increasing features doesn't take me to 80% F1-score

Solution

I would say you data in not imbalanced. 33:67 is not a bad ratio but Try using under sampling of majority class. As another option you can try difference algorithms like randomforest. You can also try boosting.

OTHER TIPS

I call imbalanced dataset is when you have a ratio of 90|10 at least. Your problem is not imbalanced.

F1 score is not a Loss Function but a metric. In your GridsearchCV you are minimising another loss function and then selecting in your folds the best F1 metric. It is important to understand these concepts.

If you want to apply Oversample/Undersample techniques you can use the following library. (Even if you don't need it)

https://pypi.org/project/imbalanced-learn/

If you want to improve your score you could try with another algorithm such as Gradient Boosting with its different implementations XGB, LightGBM, catboost ...

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange