Question

I have a relatively small dataset of 30 samples with binary labels (16 positive and 14 negative). I also have five continuous features for each these samples. I'm trying to use the support-vector classifier (SVC) for this task. I tested the performance of different feature combinations and regularization strengths in the classification task using leave-one-out cross-validation.

One odd thing that I found is that if I took feature A and used it alone for classification, I might get, say 87% classification accuracy. If I use feature B in isolation, I might get 60% classification accuracy (i.e., same as majority classifier baseline). But then combining all the features, I would get only 63% classification accuracy. This is despite performing a search across a large range of regularization strengths.

In case it matters, I'm using the sklearn SVC implementation, and varying the regularization parameter C.

Is this sort of behavior typical with an SVC classifier? I'm not too familiar with this support vector algorithms in general.

Was it helpful?

Solution

30 samples probably just isn't enough. The more features you want to use, the more samples you'll need - otherwise you'll get huge model instability and overfitting (especially with bad/useless features). With only 30 samples you'll probably get the "best" results with 1 or 2 carefully selected features. Get 100 or 200 samples, then try again with 5 features.

Also make sure you're standardizing your features - for example by removing the mean and scaling to unit variance. SVMs don't like features that are a lot larger than other features.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top