cv.glm Issue with missing factors in R

https://stackoverflow.com/questions/16950209

31-05-2022
|

Question

I am testing the performance of a logistic regression using the cv.glm crossvalidation procedure of the boot library in R.

Some of my predictor variables are factors.

When I run it I get the following error message:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
factor color has new levels RED

I guess I understand the problem. It can be that I train the regression model on a subset of Observations in which certain levels of the factor variable are not present. If this model is later used on new observations including unseen levels of the predictor variable then it doesn't know how to behave.

Since this looks to me like a fundamental CV problem, I am surprised that I did not find any mentioning in the library documentation.

I would greatly appreciate any pointers.

Solution

As I mentioned in my comment, here's the example straight from ?errorest in the ipred package:

#cv of a fixed partition of the data
list.tindx <- list(1:100, 101:200, 201:300, 301:400, 401:500,
        501:600, 601:700, 701:768)

errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
          estimator = "cv", predict = mypredict.lda,
          est.para = control.errorest(list.tindx = list.tindx))

So you can specify your own cv folds to use, and ensure that they are sufficiently balanced to avoid levels of factors being missing in any single fold.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow