Improving classifier performances in R for imbalanced dataset

https://datascience.stackexchange.com/questions/18295

22-10-2019
|

Question

I have used an "adabag"(boosting + bagging) model on an imbalanced dataset (6% positive), I have tried to maximized the sensitivity while keeping the accuracy above 70% and the best results I got were:

ROC= 0.711
SENS=0.94
SPEC=0.21

The results aren't Inhofe, especially the bad specificity. Any suggestion on how to improve the result? Can the optimization be improved, or would the addition of a penalty term help?

This is the code:

ctrl <- trainControl(method = "cv",
                     number = 5,
                     repeats = 2, 
                     p = 0.80,
                     search = "grid", 
                     initialWindow = NULL, 
                     horizon = 1,
                     fixedWindow = TRUE,
                     skip = 0,
                     verboseIter = FALSE,
                     returnData = TRUE,
                     returnResamp = "final",
                     savePredictions = "all",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     preProcOptions = list(thresh = 0.80, ICAcomp = 3, k = 7, freqCut = 90/10,uniqueCut = 10, cutoff = 0.2),
                     sampling = "smote",
                     selectionFunction = "best",
                     index = NULL,
                     indexOut = NULL,
                     indexFinal = NULL,
                     timingSamps = 0,
                     predictionBounds = rep(FALSE, 2),
                     seeds = NA,
                     adaptive = list(min = 5,alpha = 0.05, method = "gls", complete = TRUE),
                     trim = FALSE,
                     allowParallel = TRUE)

grid <- expand.grid(maxdepth = 25, mfinal = 4000)

classifier <- train(x = training_set[,-1],y = training_set[,1], method = 'AdaBag',trControl = ctrl,metric = "ROC",tuneGrid = grid)
prediction <- predict(classifier, newdata= test_set,'prob')

plot from classifierplots package:

I tried xgboost as well.

Here is the code:

gbmGrid <- expand.grid(nrounds = 50, eta = 0.3,max_depth = 3,gamma = 0,colsample_bytree=0.6,min_child_weight=1,subsample=0.75)

ctrl <- trainControl(method = "cv",
                     number = 10,
                     repeats = 2, 
                     p = 0.80,
                     search = "grid", 
                     initialWindow = NULL, 
                     horizon = 1,
                     fixedWindow = TRUE,
                     skip = 0,
                     verboseIter = FALSE,
                     returnData = TRUE,
                     returnResamp = "final",
                     savePredictions = "all",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote",
                     selectionFunction = "best",
                     index = NULL,
                     indexOut = NULL,
                     indexFinal = NULL,
                     timingSamps = 0,
                     predictionBounds = rep(FALSE, 2),
                     seeds = NA,
                     adaptive = list(min = 5,alpha = 0.05, method = "gls", complete = TRUE),
                     trim = FALSE,
                     allowParallel = TRUE)

classifier <- train(x = training_set[,-1],y = training_set[,1], method = 'xgbTree',metric = "ROC",trControl = ctrl,tuneGrid = gbmGrid)
prediction <- predict(classifier, newdata= test_set[,-1],'prob')

plot from classifierplots package:

Update:

I tried asymmetric adaboost, this is the code:

model_weights <- ifelse(training_set$readmmited == "yes",
                        (1/table(training_set$readmmited)[1]) * 0.4,
                        (1/table(training_set$readmmited)[2]) * 0.6)


ctrl <- trainControl(method = "repeatedcv",
                     number = 5,
                     repeats = 2, 
                     search = "grid", 
                     returnData = TRUE,
                     returnResamp = "final",
                     savePredictions = "all",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     selectionFunction = "best",
                     allowParallel = TRUE)

classifier <- train(x = training_set[,-1],y = training_set[,1], method = 'ada',trControl = ctrl,metric = "ROC",weights = model_weights)

but the specificity is zero, what am I doing wrong?

Solution

You should try compensating for the imbalanced data and then can you try a lot of different classifiers. Either balance it out, use SMOTE to interpolate (this always struck me as too magical), or assign weights.

Here's a nice article walking through it with caret, which is what it appears you're using:

http://dpmartin42.github.io/blogposts/r/imbalanced-classes-part-1

OTHER TIPS

SMOTE is a good strategy and also I have got significant accuracy, ROC with cost-sensitive classification. In life science, we handle a lot of imbalance datasets this paper describes approach how to handle it. https://jcheminf.springeropen.com/articles/10.1186/1758-2946-1-21

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange