How to identify Overfitting in RandomForestClassifier?

https://datascience.stackexchange.com/questions/81301

13-12-2020
|

Question

Im building a sentiment classification model using RandomForestClassifier. I got the training accuracy of 99.65 & cross-validation( RepeatedStratifiedKFold-5 folds) accuracy of 97.29. I used f1 score for metrics. The dataset size is 5184 samples. The dataset is imbalanced so i'm using class_weight hyper-parameter as 'balanced'. I have done hyper parameter tuning also. Following are the parameters i tuned -

estimator = RandomForestClassifier(random_state=42, class_weight='balanced', n_estimators=850, min_sample_split=4, max_depth=None, min_samples_leaf=1, max_features='sqrt')

Im thinking the model is overfitting. Im also wondering is this issue caused because of the class imbalance?

Any immediate help on this is much appreciated.

Solution

There's quite a lot of features for the number of instances, so it's indeed likely that there's some overfitting happening.

I'd suggest these options:

Forcing the decision trees to be less complex by setting the max_depth parameter to a low value, maybe around 3 or 4. Run the experiment with a range of values (e.g. from 3 to 10) and observe the changes in performance (preferably use a validation set, so that when the best parameter is found you can do the final evaluation on a different test set).
Reducing the number of features: remove rare words (i.e. those which appear less than $N$ times) and/or use some feature selection method.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange