Pergunta

I am trying to learn how caret works by following Max Khun's Applied Predictive Modeling book, but was not able to understand how caret's confusionMatrix function works.

I trained the training data set (training[, fullSet]), which has 8190 rows and 1073 columns, by using glmnet as follows:

glmnGrid <- expand.grid(alpha = c(0,  .1,  .2, .4, .6, .8, 1),
                    lambda = seq(.01, .2, length = 40))

ctrl <- trainControl(method = "cv", 
                 number = 10,
                 summaryFunction = twoClassSummary,
                 classProbs = TRUE,
                 index = list(TrainSet = pre2008),
                 savePredictions = TRUE)

glmnFit <- train(x = training[,fullSet], 
             y = training$Class,
             method = "glmnet",
             tuneGrid = glmnGrid,
             preProc = c("center", "scale"),
             metric = "ROC",
             trControl = ctrl)

Then, I printed the confusion matrix from the fit:

glmnetCM <- confusionMatrix(glmnFit, norm = "none")

When I looked at the confusion matrix, I got the following result:

               Reference
Prediction     successful unsuccessful
  successful          507          208
  unsuccessful         63          779

But, I don't understand why the confusion table only has 1757 observations (1757 = 507 + 208 + 63 + 779) because caret's confusionMatrix.train documentation says that "When train is used for tuning a model, it tracks the confusion matrix cell entries for the hold-out samples." Since the training data set has 8190 rows and I used a 10-fold CV, I thought that the confusion matrix should be based on 819 data points (819 = 8190 / 10), which is not the case.

Clearly I don't fully understand how caret's trainControl or train works. Can somebody explain what I misunderstood?

Thanks so much for your help.

Young-Jin Lee

Foi útil?

Solução

The issue is in the control parameter. You are using method = "cv" and number = 10 but you are also specifying the exact resamples that will be used to fit the model (via the index argument). I assume that this is the grant data from the book. In chapter 12 we describe the data splitting scheme where the pre2008 vector indicates that 6,633 of the 8,190 samples will be used for training. That leaves 1,557 left out during model tuning:

> dim(training)
[1] 8190 1785
> length(pre2008)
[1] 6633
> 8190-6633
[1] 1557

The predictions on the non-pre2008 samples are what you are seeing in the table. If you are trying to reproduce what we have, page 312 has the correct syntax:

ctrl <- trainControl(method = "LGOCV",
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE,
                     index = list(TrainSet = pre2008))

If you just want to do 10-fold CV, get rid of the index argument.

tl;dr The control function says 10-fold CV but the index argument says one hold-out of 1,557 samples should be used.

Max

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top