Question

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?

Silke

Was it helpful?

Solution

You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:

library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])

all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
  test = aSAH[indices==i,]
  learn = aSAH[indices!=i,]
  model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
  model.pred <- predict(model, newdata=test)
  aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
  all.response <- c(all.response, test$outcome)
  all.predictor <- c(all.predictor, model.pred)
}

roc(all.response, all.predictor)
mean(aucs)

The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top