Question

I'm training a RandomForestClassifier on a binary classification problem in scikit-learn. I want to maximize my auc score for the model. I understand this is not possible in the 0.13 stable version but is possible in the 0.14 bleeding edge version.

I tried this but I seemed to get a worse result:

ic = RandomForestClassifier(n_estimators=100, compute_importances=True, criterion='entropy', score_func = auc_score);

Does this work as a parameter for the model or only in gridsearchCV?

If I use it in gridsearchCV will it make the model fit the data better for auc_score? I also want to try it to maximize recall_score.

Was it helpful?

Solution

I am surprised the above does not raise an error. You can use the AUC only for model selection as in GridSearchCV. If you use it there (scoring='roc_auc' iirc), this means that the model with the best auc will be selected. It does not make the individual models better with respect to this score. It is still worth trying, though.

OTHER TIPS

I have found a journal article that addresses highly imbalanced classes with random forests. Although it is aimed at running RDF on Hadoop clusters, the same techniques seem to work well on smaller problems as well:

del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences, 285, 112-137.

http://sci2s.ugr.es/rf_big_imb/pdf/rio14_INS.pdf

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top