R caret package rfe never finishes error task 1 failed - "replacement has length zero"

StackOverflow https://stackoverflow.com/questions/22129561

  •  19-10-2022
  •  | 
  •  

Pergunta

I recently started to look into caret package for a model I'm developing. I'm using the latest version. As the first step, I decided to use it for feature selection. The data I'm using has about 760 features and 10k observations. I created a simple function based on the training material on line. Unfortunately, I consistently get an error and so the process never finishes. Here is the code that produces error. In this example I am using a small subset of features. I started with the full set of features. I've also changed the subsets, number of folds and repeats to no avail. I know it will be hard to track down the issue without the data. I have shared a small subset of the data(in r object format as used below). If you have trouble getting the file from there try this link.

It always produces this error:

Error in { : task 1 failed - "replacement has length zero"

caretFeatureSelection <- function() {
  library(caret)
  library(mlbench)
  library(Hmisc)

  set.seed(10)

  lr.features = c("f2", f271","f527","f528","f404", "f376", "f67",  "f670", "f281", "f333", "f13",  "f282", "f599",
                  "f597", "f68",  "f629", "f378", "f230", "f229", "f273", "f768", "f406", "f630", 
                  "f596", "f598", "f413", "f412", "f332", "f377", "f766", "f767", "f775", "f10", "f442")

  trainDF <- readRDS(file='trainDF.rds')
  trainDF <- trainDF[trainDF$loss>0,]
  trainDF$lossProb <- trainDF$loss/100
  y <- trainDF[,'lossProb']
  x <- trainDF[,names(trainDF) %in% lr.features]

  rm(trainDF)

  subsets <- c(1:5, 10, 15, 20, 25)
  ctrl <- rfeControl(functions = lrFuncs,
                   method = "repeatedcv",
                   repeats = 1,
                   number=5)

  lrProfile <- rfe(x, y,
                 sizes = subsets,
                 rfeControl = ctrl)

  lrProfile
}
Foi útil?

Solução

So looking at the data, there are three reasons for the failure. First,

> str(x)
'data.frame':   100 obs. of  34 variables:
 $ f2  : Factor w/ 10 levels "1","2","3","4",..: 8 8 8 8 9 8 9 9 7 8 ...
<snip>

rfe fits an lm model to these data and generates 39 coefficients even though the data frame x has 34 columns. As a result, rfe gets... confused. Try using model.matrix to convert the factor to dummy variables before running rfe:

x2 <- model.matrix(~., data = x)[,-1]  ## the -1 removes the intercept column

... but...

> table(x$f2)

 1  2  3  4  6  7  8  9 10 11 
 0  0  0  2  2  5 32 36 23  0 

so model.matrix will generate some zero-variance predictors (which is an issue). You could make a new factor with new levels that excludes the empty levels but keep in mind that any resampling on these data will coerce some of the factor levels (e.g. "4", "6") into zero-variance predictors.

Secondly, there is perfect correlation between some predictors:

> cor(x$f597, x$f599)
     [,1]
[1,]    1

This will cause NA values for some of the model coefficients and lead to missing variable importances and will tank rfe.

Unless you are using trees or some other model that is tolerant to sparse and/or correlated predictors, a possible workflow prior to rfe could be:

> x2 <- model.matrix(~., data = x)[,-1]
> 
> nzv <- nearZeroVar(x2)
> x3 <- x2[, -nzv]
> 
> corr_mat <- cor(x3)
> too_high <- findCorrelation(corr_mat, cutoff = .9)
> x4 <- x3[, -too_high]
> 
> c(ncol(x2), ncol(x3), ncol(x4))
[1] 42 37 27

Lastly, by the looks of y you want to predict a number but lrFuncs is for logistic regression so I assume it was a typo for lmFuncs. If that is the case, rfe works fine:

> subsets <- c(1:5, 10, 15, 20, 25)
> ctrl <- rfeControl(functions = lmFuncs,
+                    method = "repeatedcv",
+                    repeats = 1,
+                    number=5)
> set.seed(1)
> lrProfile <- rfe(as.data.frame(x4), y,
+                  sizes = subsets,
+                  rfeControl = ctrl)

Max

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top