This probably means that there is a significant discrepancy between the distribution of the final evaluation data and the development set.
It would be interesting to measure the over-fitting of your decision trees though: what is the difference between the training score clf.score(X_train, y_train)
and the testing score clf.score(X_test, y_test)
on your split?
Also pure decision trees should be considered a toy classifier. They have very poor generalization properties (and can overfit a lot). You should really try ExtraTreesClassifier
with increasing numbers for n_estimators
. Start with n_estimators=10
, then 50, 100, 500, 1000 if the dataset is small enough.