If you want to fit these types of models with caret
, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.
My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp
and mtry
is used.
The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):
modelInfo <- list(label = "PLS-RF",
library = c("pls", "randomForest"),
type = "Classification",
parameters = data.frame(parameter = c('ncomp', 'mtry'),
class = c("numeric", 'numeric'),
label = c('#Components',
'#Randomly Selected Predictors')),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
mtry = 1:len)
grid <- subset(grid, mtry <= ncomp)
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the random forest object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- randomForest(scores, y, mtry = param$mtry, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
and here is the call to train
:
library(ChemometricsWithR)
data(prostate)
set.seed(1)
inTrain <- createDataPartition(prostate.type, p = .90)
trainX <-prostate[inTrain[[1]], ]
trainY <- prostate.type[inTrain[[1]]]
testX <-prostate[-inTrain[[1]], ]
testY <- prostate.type[-inTrain[[1]]]
## These will take a while for these data
set.seed(2)
plsrf <- train(trainX, trainY, method = modelInfo,
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
## How does random forest do on its own?
set.seed(2)
rfOnly <- train(trainX, trainY, method = "rf",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Just for kicks, I got:
> getTrainPerf(plsrf)
TrainAccuracy TrainKappa method
1 0.7940423 0.65879 custom
> getTrainPerf(rfOnly)
TrainAccuracy TrainKappa method
1 0.7794082 0.6205322 rf
and
> postResample(predict(plsrf, testX), testY)
Accuracy Kappa
0.7741935 0.6226087
> postResample(predict(rfOnly, testX), testY)
Accuracy Kappa
0.9032258 0.8353982
Max