Question

I fit my dataset to the random forest classifier and found that the model performance would vary among different sets of train and test data split. As what I have observed, it would jump from 0.67 to 0.75 in AUC under ROC curve (fitted by the same model under same setting of parameters) and the underlying range may be wider than that. So what is the issue behind this phenomena and how to deal with this problem? As my understanding, cross validation is used for a specific split of train and test data.

Was it helpful?

Solution

While training, your model will not have the same output when you train with different parts of the dataset. Cross validation is used to help negate this, by rotating the training and validation sets and training more.

Your dataset most likely has high variance, given the large jump in accuracy based on different validation sets. This means that the data is spread out, and can result in overfitting the model. You can imagine an overfitted model like this:

Green line is the overfitted model

The green line represents the overfitted model.

Common techniques to reduce overfitting in random forests is k-fold Cross Validation, with k being between 5 and 10, and growing a larger forest.

OTHER TIPS

What you are experiencing is not a problem, but rather an inherent attribute of all classifiers. The performance of a classifier depends on the training set, therefore the performance will vary with different training sets.

To find the best parameters for a specific classifier you will therefore want to vary training and test split (such as in crossvalidation) and choose the parameter set which achieves the best average accuracy or AUC.

Finally you will want to test the trained classifier on another dataset - evaluation set - which has not been part of the dataset used in crossvalidation.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top