Poor performance shown on Rare event modeling

https://datascience.stackexchange.com/questions/6042

16-10-2019
|

Question

I am working on a Rare event classification problem. I Have 95% of the data as a majority class and 5% of the data as the minority class. I use classification trees algorithm. I am measuring the goodness of the model using confusion matrix.

As the i have the minority class just 5% of the total data, even though my prediction performance of minority class is close to 70%, the total number of errors are high.

For example, here is my confusion matrix. 0 1 0 213812 7008 1 29083 16877

Though the Minority class(class 1) has predicted 16877 times correctly(70% and the misclassifcation is just 30%, but the absolute value of the misclassifcation is very high(29083) comparing to the correctly predicted minotriy class (16877). Which makes the solution less usable for the business.

Is there any idea on handling these kind of issues in such rare event modelling.

Kind note: I have balanced the target variable using the SMOTE algorithm before applying Classification tree.

Solution

If you are willing to use the caret package in R and use random forests, you can use the method in the following blog post for downsampling with unbalanced datasets: http://appliedpredictivemodeling.com/blog/2013/12/8/28rmc2lv96h8fw8700zm4nl50busep

Basically, you just add a single line to your train call. Here is the relevant part:

> rfDownsampled <- train(Class ~ ., data = training,
+                        method = "rf",
+                        ntree = 1500,
+                        tuneLength = 5,
+                        metric = "ROC",
+                        trControl = ctrl,
+                        ## Tell randomForest to sample by strata. Here, 
+                        ## that means within each class
+                        strata = training$Class,
+                        ## Now specify that the number of samples selected
+                        ## within each class should be the same
+                        sampsize = rep(nmin, 2))

I have had some success with this approach in your type of situation.

For some more context, here is an in-depth post about experiments with unbalanced datasets: http://www.win-vector.com/blog/2015/02/does-balancing-classes-improve-classifier-performance/

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange