Cross-validating a CART model

https://stackoverflow.com/questions/16717970

30-05-2022
|

Question

In an assignment, we are asked to perform a cross-validation on a CART model. I have tried using the cvFit function from cvTools but got a strange error message. Here's a minimal example:

library(rpart)
library(cvTools)
data(iris)
cvFit(rpart(formula=Species~., data=iris))

The error I'm seeing is:

Error in nobs(y) : argument "y" is missing, with no default

And the traceback():

5: nobs(y)
4: cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K, 
       R = R, foldType = foldType, folds = folds, names = names, 
       predictArgs = predictArgs, costArgs = costArgs, envir = envir, 
       seed = seed)
3: cvFit(call, data = data, x = x, y = y, cost = cost, K = K, R = R, 
       foldType = foldType, folds = folds, names = names, predictArgs = predictArgs, 
       costArgs = costArgs, envir = envir, seed = seed)
2: cvFit.default(rpart(formula = Species ~ ., data = iris))
1: cvFit(rpart(formula = Species ~ ., data = iris))

It looks that y is mandatory for cvFit.default. But:

> cvFit(rpart(formula=Species~., data=iris), y=iris$Species)
Error in cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K,  : 
  'x' must have 0 observations

What am I doing wrong? Which package would allow me to do a cross-validation with a CART tree without having to code it myself? (I am sooo lazy...)

Solution

The caret package makes cross validation a snap:

> library(caret)
> data(iris)
> tc <- trainControl("cv",10)
> rpart.grid <- expand.grid(.cp=0.2)
> 
> (train.rpart <- train(Species ~., data=iris, method="rpart",trControl=tc,tuneGrid=rpart.grid))
150 samples
  4 predictors
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 

Resampling results

  Accuracy  Kappa  Accuracy SD  Kappa SD
  0.94      0.91   0.0798       0.12    

Tuning parameter 'cp' was held constant at a value of 0.2

OTHER TIPS

Finally, I was able to get it to work. As Joran noted, the cost parameter needs to be adapted. In my case I am using 0/1 loss, which means that I use a simple function that evaluates != instead of - between y and yHat. Also, predictArgs must include c(type='class'), otherwise the predict call used internally will return a vector of probabilities instead of the most probable classification. To sum up:

library(rpart)
library(cvTools)
data(iris)
cvFit(rpart, formula=Species~., data=iris,
      cost=function(y, yHat) (y != yHat) + 0, predictArgs=c(type='class'))

(This uses another variant of cvFit. Additional args to rpart can be passed by setting the args= parameter.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow