Pergunta

I am using Scikit-Learn for this classification problem. The dataset has 3 features and 600 data points with labels.

First I used Nearest Neighbor classifier. Instead of using cross-validation, I manually run the fit 5 times and everytime resplit the dataset (80-20) to training set and test set. The average score turns out to be 0.61

clf = KNeighborsClassifier(4)
score = 0
for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        clf.fit(X_train, y_train)
        score += clf.score(X_test, y_test)
print(scores / 5.0)

However when I ran cross-validation, the average score is merely 0.45.

clf =  KNeighborsClassifier(4)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

Why does cross-validation produce significantly lower score than manual resampling?

I also tried Random Forest classifier. This time using Grid Search to tune the parameters:

param_grid = {
    'bootstrap': [True],
    'max_depth': [8, 10, 12],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [62, 64, 66, 68, 70]
}
clf_ = RandomForestClassifier()
grid_search = GridSearchCV(estimator = clf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(X, y)
grid_search.best_params_, grid_search.best_score_

The best score turned out to be 0.508 with the following parameters

({'bootstrap': True,
  'max_depth': 10,
  'min_samples_leaf': 4,
  'min_samples_split': 10,
  'n_estimators': 64},
 0.5081967213114754)

I went ahead make prediction on the whole 600 data points and accuracy is quite high 0.7688.

best_grid = grid_search.best_estimator_
y_pred = best_grid.predict(X)
accuracy_score(y, y_pred)

I know .best_score_ is the 'Mean cross-validated score of the best_estimator.` But I don't understand why it seems so much lower than the prediction accuracy on the whole set.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top