Question

I'm using scikit-learn 0.13.1 for a contest on Kaggle. I'm using a Decision Tree classifier, and to evaluate my estimator I follow the techniques of splitting the training data via train_test_split, or doing cross validation using cross_val_score. Either technique would show that the estimator is about 90% accurate. However, when I use the estimator on actual test data, the accuracy obtained is about 30% lower. Let's assume that the training data is a good representation of the test data.

What else can I do to evaluate the accuracy of the estimator?

clf = tree.DecisionTreeClassifier( )
...
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=42)
...
clf.fit(X_train, y_train)
print "Accuracy: %0.2f " % clf.score(X_test, y_test)
...    
scores = cv.cross_val_score(clf, train, target, cv=15)
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() / 2)
Was it helpful?

Solution

This probably means that there is a significant discrepancy between the distribution of the final evaluation data and the development set.

It would be interesting to measure the over-fitting of your decision trees though: what is the difference between the training score clf.score(X_train, y_train) and the testing score clf.score(X_test, y_test) on your split?

Also pure decision trees should be considered a toy classifier. They have very poor generalization properties (and can overfit a lot). You should really try ExtraTreesClassifier with increasing numbers for n_estimators. Start with n_estimators=10, then 50, 100, 500, 1000 if the dataset is small enough.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top