Scaling should certainly affect results, but it should improve them. However, the performance of an SVM is critically dependent on its C
setting, which trades off the cost of misclassification on the training set vs. model simplicity, and which should be determined using e.g. grid search and nested cross-validation. The default settings are very rarely optimal for any given problem.
SVM: scaled dataset gives worse results?
-
18-06-2023 - |
Pergunta
I have a multiclass classification problem. My dataset (let's call data X
and labels - y
) represents sets of points on 640x480 images, so all elements in X
are integers in range of valid pixels. I'm trying to use SVM for this problem. If I run SVM against dataset as is, it gives accuracy of 74%. However, if I scale data to range [0..1]
, it gives much poorer results - only 69% of correct results.
I double checked histogram of elements in X
and its scaled version Xs
, and they are identical. So data is not corrupted, just normalized. Knowing ideas behind SVM I assumed scaling should not affect results, but it does. So why does it happen?
Here's my code in case I made mistake in it:
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.svm import SVC
>>>
>>> X, y = ...
>>> Xs = X.astype(np.float32) / (X.max() - X.min())
>>> cross_val_score(SVC(kernel='linear'), X, y, cv=10).mean()
0.74531073446327667
>>> cross_val_score(SVC(kernel='linear'), Xs, y, cv=10).mean()
0.69485875706214695
Solução
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow