There are some confusing things here. I think what you are describing is more of a standard train/test split, the word cross-validation is usually used differently. So you've held out 30% of the data for testing, which is good, and you can use that to find out how optimistic your train set estimate of AUC is. But of course the estimate depends on how you do the train/test split, and it would be good to know how much this test performance varies. You can use multiple runs of cross-validation to achieve this.
Cross-validation is slightly from just using a holdout set - five fold cross validation, for example, involves the following steps:
- Randomly split the full dataset into five equal sized parts.
- For i = 1 to 5, fit the model on all the data except the ith part.
- Evaluate AUC on the part that was held out from the fit.
- Average the five AUC results.
This process can be repeated multiple times to estimate the mean and variance of the out of sample estimate.
The R package cvTools allows you to do this. For example
library(ROCR)
library(cvTools)
calc_AUC <- function(pred, act) {
u<-prediction(pred, act)
return(performance(u, "auc")@y.values[[1]])
}
cvFit(m, data = train, y = train$response,
cost = calc_AUC, predictArgs = "response")
will perform 5-fold cross-validatino of the model m using AUC as the performance metric. cvFit
also takes arguments K
(number of cross-validation folds) and R
(number of times to perform the cross-validation with different random splits).
See http://en.wikipedia.org/wiki/Cross-validation_(statistics) from more info on cross-validation.