Pregunta

I have a multiclass classification problem. My dataset (let's call data X and labels - y) represents sets of points on 640x480 images, so all elements in X are integers in range of valid pixels. I'm trying to use SVM for this problem. If I run SVM against dataset as is, it gives accuracy of 74%. However, if I scale data to range [0..1], it gives much poorer results - only 69% of correct results.

I double checked histogram of elements in X and its scaled version Xs, and they are identical. So data is not corrupted, just normalized. Knowing ideas behind SVM I assumed scaling should not affect results, but it does. So why does it happen?


Here's my code in case I made mistake in it:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.svm import SVC
>>> 
>>> X, y = ...
>>> Xs = X.astype(np.float32) / (X.max() - X.min())    
>>> cross_val_score(SVC(kernel='linear'), X, y, cv=10).mean()
0.74531073446327667
>>> cross_val_score(SVC(kernel='linear'), Xs, y, cv=10).mean()
0.69485875706214695
¿Fue útil?

Solución

Scaling should certainly affect results, but it should improve them. However, the performance of an SVM is critically dependent on its C setting, which trades off the cost of misclassification on the training set vs. model simplicity, and which should be determined using e.g. grid search and nested cross-validation. The default settings are very rarely optimal for any given problem.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top