scikit-learn: SVM giving me zero error, but can't predict

https://stackoverflow.com/questions/21393704

03-10-2022
|

Question

I am working on a support vector machine, using sci-kit learn in Python.

I have trained the model, used GridSearch and cross-validation to find the optimal parameters, and have evaluated the best model on a 15% holdout set.

The confusion matrix at the end says I have 0 misclassifications.
Later the model gave me incorrect predictions when I give it a handwritten digit (I haven't included the code for this, to keep this question respectfully short).

Because the SVM has zero error and further, later on it can't predict correctly, I have built this SVM incorrectly.

My question is this:

Am I right to suspect I used Cross Validation along with GridSearch somehow incorrectly? Or have I given GridSearch parameters that are somehow ridiculous, and are giving me false results?

Thanks for your time and effort for reading this far.

STEP 1: split the data set into 85%/15% using the train_test_split function

X_train, X_test, y_train, y_test =
cross_validation.train_test_split(X, y, test_size=0.15,
random_state=0)

STEP 2: apply the GridSearchCV function to the training set to tune the classifier

C_range = 10.0 ** np.arange(-2, 9)
gamma_range = 10.0 ** np.arange(-5, 4)
param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedKFold(y=y, n_folds=3)

grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)
grid.fit(X, y)

print("The best classifier is: ", grid.best_estimator_)

The output is here:

('The best classifier is: ', SVC(C=10.0, cache_size=200,
class_weight=None, coef0=0.0, degree=3,
 gamma=0.0001, kernel='rbf', max_iter=-1, probability=False,
 random_state=None, shrinking=True, tol=0.001, verbose=False))

STEP 3: Finally, evaluate the tuned classifier on the remaining 15% hold-out set.

clf = svm.SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

clf.fit(X_train, y_train)

clf.score(X_test, y_test)
y_pred = clf.predict(X_test)

The output is here:

precision recall f1-score support

      -1.0       1.00      1.00      1.00         6
       1.0       1.00      1.00      1.00        30

avg / total       1.00      1.00      1.00        36

Confusion Matrix:
[[ 6  0]
[ 0 30]]

Solution

You have too few data in your test set (only 6 samples for one of the classes) to be confident in the predictive accuracy of your model. I would recommend labeling at least 150 samples per classes and keep 50 samples in the held out test to compute the evaluation metrics.

Edit: also have a look at the new sample that it fails to predict: are the feature values in the same range (e.g [0, 255] instead of [0, 1] or [-1, 1] for the digits from training and test sets)? does the new digit "looks" like other digits from your test set when you plot them using matplotlib for instance?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow