Pergunta

I am using the caret package in R for training a radial basis SVM for classification; in addition, a linear SVM is used for variable selection. With metric="Accuracy", this works fine, but eventually I am more interested in optimizing metric="ROC". While the ROC is calculated for all models that are fit, there seems to be some problem with aggregating the ROC values.

The following is some example code:

library(caret)
library(mlbench)

set.seed(0)

data(Sonar)
x<-scale(Sonar[,1:60])
y<-as.factor(Sonar[,61])

# Custom summary function to use both
# defaultSummary() and twoClassSummary
# Also input and output of summary function are printed

svm.summary<-function(data, lev = NULL, model = NULL){
 print(head(data,n=3))
 a<-defaultSummary(data, lev, model)
 b<-twoClassSummary(data, lev, model)
 out<-c(a,b)
 print(out)
 out}

fitControl <- trainControl(
 method = "cv",
 number = 2,
 classProbs = TRUE,
 summaryFunction=svm.summary,
 verbose=T,
 allowParallel = FALSE)

# Ranking function: Rank Variables using a linear 
# SVM 

rankSVM<-function(object,x,y) {
 print("ranking")
 obj<-ksvm(x=as.matrix(x), y=y, 
  kernel=vanilladot,
  kpar=list(), C=10,
  scaled=F)
 w<-t(obj@coef[[1]]%*%obj@xmatrix[[1]])
 z<-abs(w)/sqrt(sum(w^2))
 ord<-order(z,decreasing=T)
 data.frame(var=dimnames(z)[[1]][ord],Overall=z[ord])
}


svmFuncs<-getModelInfo("svmRadial",regex=F)

svmFit<-function(x,y,first,last,...) {
 out<-train(x=x,y=as.factor(y),    
  method="svmRadial",
  trControl=fitControl,
  scaled=F,
  metric="Accuracy",
  maximize=T,
  returnData=T)
  out$finalModel}

selectionFunctions<-list(summary=svm.summary,
 fit=svmFit,
 pred=svmFuncs$svmRadial$predict,
 prob=svmFuncs$svmRadial$prob,
 rank=rankSVM,
 selectSize=pickSizeBest,
 selectVar=pickVars)                         

selectionControl<-rfeControl(functions=selectionFunctions,
 rerank=F,
 verbose=T,
 method="cv",
 number=2)

subsets<-c(1,30,60)

svmProfile<-rfe(x=x,y=y,
 sizes=subsets,
 metric="Accuracy",
 maximize=TRUE,
 rfeControl=selectionControl)

svmProfile

The final output is the following:

> svmProfile

Recursive feature selection

Outer resampling method: Cross-Validated (2 fold) 

Resampling performance over subset size:

Variables Accuracy  Kappa ROC   Sens   Spec AccuracySD KappaSD ROCSD  SensSD SpecSD Selected
        1   0.8075 0.6122 NaN 0.8292 0.7825    0.02981 0.06505    NA 0.06153 0.1344        *
       30   0.8028 0.6033 NaN 0.8205 0.7825    0.00948 0.02533    NA 0.09964 0.1344         
       60   0.8028 0.6032 NaN 0.8206 0.7823    0.00948 0.02679    NA 0.12512 0.1635         

The top 1 variables (out of 1):
V49

ROC is NaN. Inspecting the output (as verbose=T and the summary function was patched to display both its output and parts of its input) reveals that while when tuning the SVMs in the inner loop, ROC seems to be calculated correctly:

+ Fold1: sigma=0.01172, C=0.25 
  pred obs         M         R
1    M   R 0.6658878 0.3341122
2    M   R 0.5679477 0.4320523
3    R   R 0.2263576 0.7736424
 Accuracy     Kappa       ROC      Sens      Spec 
0.6730769 0.3480826 0.7961310 0.6428571 0.7083333 
- Fold1: sigma=0.01172, C=0.25 
+ Fold1: sigma=0.01172, C=0.50 
  pred obs         M         R
1    M   R 0.7841249 0.2158751
2    M   R 0.7231365 0.2768635
3    R   R 0.3033492 0.6966508
 Accuracy     Kappa       ROC      Sens      Spec 
0.7692308 0.5214724 0.8407738 0.9642857 0.5416667 
- Fold1: sigma=0.01172, C=0.50 

[...]

there seems to be a problem in the outer iteration. "Between" two folds we get the following:

-(rfe) fit Fold1 size:  1 
  pred obs Variables
1    M   R         1
2    M   R         1
3    M   R         1
 Accuracy     Kappa       ROC      Sens      Spec 
0.7864078 0.5662328        NA 0.8727273 0.6875000 
  pred obs Variables
1    R   R        30
2    M   R        30
3    M   R        30
 Accuracy     Kappa       ROC      Sens      Spec 
0.7961165 0.5853939        NA 0.8909091 0.6875000 
  pred obs Variables
1    R   R        60
2    M   R        60
3    M   R        60
 Accuracy     Kappa       ROC      Sens      Spec 
0.7961165 0.5842783        NA 0.9090909 0.6666667 
+(rfe) fit Fold2 size: 60 

So here it seems the input for the summary function is a matrix that does not contain the class probabilities but the number of variables instead, and so the ROCs cannot be calculated / aggregated correctly. Does anybody know how to prevent this? Did I forget to tell caret to output class probabilities in some place?

Help is greatly appreciated, as caret is really a cool package to use and would save me plenty of work if I can get this to run correctly.

Thoralf

Foi útil?

Solução

getModelInfo is designed to get code for train and doesn't automatically work with rfe (I'll make a note of that in the documentation). rfe doesn't look for a slot called probs and no probability predictions means not ROC summary.

You might want base your code on caretFuncs, which is designed to work with rfe and should automate a lot of what I think you would like to do.

For example, in caretFuncs, the pred module will create class and probability predictions:

function(object, x) {
  tmp <- predict(object, x)
  if(object$modelType == "Classification" &
     !is.null(object$modelInfo$prob)) {
         out <- cbind(data.frame(pred = tmp),
                      as.data.frame(predict(object, x, type = "prob")))
         } else out <- tmp
      out
  }

You might be able to simply plug in your rankSVM into caretFuncs$rank.

Take a look at the feature selection page on the website. It has details about what code modules you will need.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top