Question

I have train data which I randomly split in two parts:

  • 70% -> train_train
  • 30% -> train_cv (for cross-validation)

I fit a glm (glmnet) model using train_train, then cross-validate with train_cv.

My problem is that a different random split for train_train and train_cv returns different cross-validation results (evaluated using Area Under the Curve, "AUC"):

AUC = 0.6381583 the 1st time

AUC = 0.6164524 the 2nd time

Is there a way to run multiple cross-validations, without duplicating the code?

Was it helpful?

Solution

There are some confusing things here. I think what you are describing is more of a standard train/test split, the word cross-validation is usually used differently. So you've held out 30% of the data for testing, which is good, and you can use that to find out how optimistic your train set estimate of AUC is. But of course the estimate depends on how you do the train/test split, and it would be good to know how much this test performance varies. You can use multiple runs of cross-validation to achieve this.

Cross-validation is slightly from just using a holdout set - five fold cross validation, for example, involves the following steps:

  1. Randomly split the full dataset into five equal sized parts.
  2. For i = 1 to 5, fit the model on all the data except the ith part.
  3. Evaluate AUC on the part that was held out from the fit.
  4. Average the five AUC results.

This process can be repeated multiple times to estimate the mean and variance of the out of sample estimate.

The R package cvTools allows you to do this. For example

library(ROCR)
library(cvTools)

calc_AUC <- function(pred, act) {
  u<-prediction(pred, act)
  return(performance(u, "auc")@y.values[[1]])
}

cvFit(m, data = train, y = train$response, 
    cost = calc_AUC, predictArgs = "response")

will perform 5-fold cross-validatino of the model m using AUC as the performance metric. cvFit also takes arguments K (number of cross-validation folds) and R (number of times to perform the cross-validation with different random splits).

See http://en.wikipedia.org/wiki/Cross-validation_(statistics) from more info on cross-validation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top