Question

I use matlab sequentialfs function for forward feature selection, the code is below. I repeatedly run the same code several times, I noticed that the results are quite different. Although cross validation is different for each run (fold number is the same), but I think the selected features should be roughly same. Could somebody help explain this? Thanks.

cp = cvpartition(label,'k',cvNum); % Stratified cross-validation

opts = statset('display','iter');
fun = @(XT,yT,Xt,yt)...
    (sum(yt ~= SVCpredict(Xt,yt,XT,yT)));

[fs,history] = sequentialfs(fun,data,label,'cv',cp,'options',opts);
Was it helpful?

Solution

If your data contains some variables that are highly predictive, and others not very predictive at all, then you would expect the set of variables selected by a feature subset selection method such as sequentialfs to be fairly stable, when run several times with a randomized cross-validation.

But if the data contains variables that are all pretty equal in their predictive power (especially if none are very predictive at all), then you'd expect the set of variables selected to vary more when run with a randomized cross-validation.

So if you're getting very different variables selected, just by changing the cross-validation folds, that would be evidence that your data does not contain any particular subset of variables that is much more predictive than the rest.

You might conclude (although it's your data, so you would know better than me, and this would depend on the context) that feature subset selection is not the best way to proceed, and that some other form of dimensionality reduction might be better (such as, if your data is numerical, PCA).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top