Binary classification model for unbalanced data

https://datascience.stackexchange.com/questions/531

16-10-2019
|

Question

I have a dataset with the following specifications:

Training dataset with 193,176 samples with 2,821 positives
Test Dataset with 82,887 samples with 673 positives
There are 10 features.

I want to perform a binary classification (0 or 1). The issue I am facing is that the data is very unbalanced. After normalization and scaling the data along with some feature engineering and using a couple of different algorithms, these are the best results I could achieve:

mean square error : 0.00804710026904
Confusion matrix : [[82214   667]
                   [    0     6]]

i.e only 6 correct positive hits. This is using logistic regression. Here are the various things I tried with this:

Different algorithms like RandomForest, DecisionTree, SVM
Changing parameters value to call the function
Some intuition based feature engineering to include compounded features

Now, my questions are:

What can I do to improve the number of positive hits ?
How can one determine if there is an overfit in such a case ? ( I have tried plotting etc. )
At what point could one conclude if maybe this is the best possible fit I could have? ( which seems sad considering only 6 hits out of 673 )
Is there a way I could make the positive sample instances weigh more so the pattern recognition improves leading to more hits ?
Which graphical plots could help detect outliers or some intuition about which pattern would fit the best?

I am using the scikit-learn library with Python and all implementations are library functions.

edit:

Here are the results with a few other algorithms:

Random Forest Classifier(n_estimators=100)

[[82211   667]
[    3     6]]

Decision Trees:

[[78611   635]
[ 3603    38]]

Solution

Since you are doing binary classification, have you tried adjusting the classification threshold? Since your algorithm seems rather insensitive, I would try lowering it and check if there is an improvement.
You can always use Learning Curves, or a plot of one model parameter vs. Training and Validation error to determine whether your model is overfitting. It seems it is under fitting in your case, but that's just intuition.
Well, ultimately it depends on your dataset, and the different models you have tried. At this point, and without further testing, there can not be a definite answer.
Without claiming to be an expert on the topic, there are a number of different techniques you may follow (hint: first link on google), but in my opinion you should first make sure you choose your cost function carefully, so that it represents what you are actually looking for.
Not sure what you mean by pattern intuition, can you elaborate?

By the way, what were your results with the different algorithms you tried? Were they any different?

OTHER TIPS

Since the data is very skewed, in such a case we can also try model training after over-sampling data.

SMOTE and ADASYN are some of the techniques that we can use to over-sample the data.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange