So looking at the data, there are three reasons for the failure. First,
> str(x)
'data.frame': 100 obs. of 34 variables:
$ f2 : Factor w/ 10 levels "1","2","3","4",..: 8 8 8 8 9 8 9 9 7 8 ...
<snip>
rfe
fits an lm
model to these data and generates 39 coefficients even though the data frame x
has 34 columns. As a result, rfe
gets... confused. Try using model.matrix
to convert the factor to dummy variables before running rfe
:
x2 <- model.matrix(~., data = x)[,-1] ## the -1 removes the intercept column
... but...
> table(x$f2)
1 2 3 4 6 7 8 9 10 11
0 0 0 2 2 5 32 36 23 0
so model.matrix
will generate some zero-variance predictors (which is an issue). You could make a new factor with new levels that excludes the empty levels but keep in mind that any resampling on these data will coerce some of the factor levels (e.g. "4", "6") into zero-variance predictors.
Secondly, there is perfect correlation between some predictors:
> cor(x$f597, x$f599)
[,1]
[1,] 1
This will cause NA
values for some of the model coefficients and lead to missing variable importances and will tank rfe
.
Unless you are using trees or some other model that is tolerant to sparse and/or correlated predictors, a possible workflow prior to rfe
could be:
> x2 <- model.matrix(~., data = x)[,-1]
>
> nzv <- nearZeroVar(x2)
> x3 <- x2[, -nzv]
>
> corr_mat <- cor(x3)
> too_high <- findCorrelation(corr_mat, cutoff = .9)
> x4 <- x3[, -too_high]
>
> c(ncol(x2), ncol(x3), ncol(x4))
[1] 42 37 27
Lastly, by the looks of y
you want to predict a number but lrFuncs
is for logistic regression so I assume it was a typo for lmFuncs
. If that is the case, rfe
works fine:
> subsets <- c(1:5, 10, 15, 20, 25)
> ctrl <- rfeControl(functions = lmFuncs,
+ method = "repeatedcv",
+ repeats = 1,
+ number=5)
> set.seed(1)
> lrProfile <- rfe(as.data.frame(x4), y,
+ sizes = subsets,
+ rfeControl = ctrl)
Max