Question

Im building a sentiment classification model using RandomForestClassifier. I got the training accuracy of 99.65 & cross-validation( RepeatedStratifiedKFold-5 folds) accuracy of 97.29. I used f1 score for metrics. The dataset size is 5184 samples. The dataset is imbalanced so i'm using class_weight hyper-parameter as 'balanced'. I have done hyper parameter tuning also. Following are the parameters i tuned -

estimator = RandomForestClassifier(random_state=42, class_weight='balanced', n_estimators=850, min_sample_split=4, max_depth=None, min_samples_leaf=1, max_features='sqrt')

Im thinking the model is overfitting. Im also wondering is this issue caused because of the class imbalance?

Any immediate help on this is much appreciated.

Was it helpful?

Solution

There's quite a lot of features for the number of instances, so it's indeed likely that there's some overfitting happening.

I'd suggest these options:

  • Forcing the decision trees to be less complex by setting the max_depth parameter to a low value, maybe around 3 or 4. Run the experiment with a range of values (e.g. from 3 to 10) and observe the changes in performance (preferably use a validation set, so that when the best parameter is found you can do the final evaluation on a different test set).
  • Reducing the number of features: remove rare words (i.e. those which appear less than $N$ times) and/or use some feature selection method.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top