How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?

Question 1

If you want to fit these types of models with caret, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.

My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp and mtry is used.

The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):

 modelInfo <- list(label = "PLS-RF",
              library = c("pls", "randomForest"),
              type = "Classification",
              parameters = data.frame(parameter = c('ncomp', 'mtry'),
                                      class = c("numeric", 'numeric'),
                                      label = c('#Components', 
                                                '#Randomly Selected Predictors')),
              grid = function(x, y, len = NULL) {
                grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
                            mtry = 1:len)
                grid <- subset(grid, mtry <= ncomp)
                },
              loop = NULL,
              fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                     ## First fit the pls model, generate the training set scores,
                     ## then attach what is needed to the random forest object to 
                     ## be used later
                     pre <- plsda(x, y, ncomp = param$ncomp)
                     scores <- pls:::predict.mvr(pre, x, type = "scores")
                     mod <- randomForest(scores, y, mtry = param$mtry, ...)
                     mod$projection <- pre$projection
                     mod
                   },
                   predict = function(modelFit, newdata, submodels = NULL) {       
                     scores <- as.matrix(newdata)  %*% modelFit$projection
                     predict(modelFit, scores)
                   },
                   prob = NULL,
                   varImp = NULL,
                   predictors = function(x, ...) rownames(x$projection),
                   levels = function(x) x$obsLevels,
                   sort = function(x) x[order(x[,1]),])

and here is the call to train:

 library(ChemometricsWithR)
 data(prostate)

 set.seed(1)
 inTrain <- createDataPartition(prostate.type, p = .90)
 trainX <-prostate[inTrain[[1]], ]
 trainY <- prostate.type[inTrain[[1]]]
 testX <-prostate[-inTrain[[1]], ]
 testY <- prostate.type[-inTrain[[1]]]

 ## These will take a while for these data
 set.seed(2)
 plsrf <- train(trainX, trainY, method = modelInfo,
                preProc = c("center", "scale"),
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

 ## How does random forest do on its own?
 set.seed(2)
 rfOnly <- train(trainX, trainY, method = "rf",
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

Just for kicks, I got:

 > getTrainPerf(plsrf)
   TrainAccuracy TrainKappa method
 1     0.7940423    0.65879 custom
 > getTrainPerf(rfOnly)
   TrainAccuracy TrainKappa method
 1     0.7794082  0.6205322     rf

and

 > postResample(predict(plsrf, testX), testY)
  Accuracy     Kappa 
 0.7741935 0.6226087 
 > postResample(predict(rfOnly, testX), testY)
  Accuracy     Kappa 
 0.9032258 0.8353982

Max

Question 2

Based on Max's valuable comments, I felt the need to have IRIS referee, which is famous for classification, and more importantly the Species outcome has more than two classes, which would be a good data set to test the PLS-LDA custom model in caret:

data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
                   data = training,
                   method = "pls",
                   tuneLength = 4,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)


set.seed(2)
plsldaFitIris <- train(Species ~ .,
                      data = training,
                      method = modelInfo,
                      tuneLength = 4,
                      trControl = ctrl,
                      preProc = c("center", "scale"))

plsldaFitIris
plot(plsldaFitIris)

Now comparing the two models:

getTrainPerf(plsFitIris)
  TrainAccuracy TrainKappa method
1     0.8574242  0.7852462    pls
getTrainPerf(plsldaFitIris)
  TrainAccuracy TrainKappa method
1      0.975303  0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy    Kappa 
   0.750    0.625 
postResample(predict(plsldaFitIris, testing), testing$Species)
 Accuracy     Kappa 
0.9444444 0.9166667

So, finally there was the EXPECTED difference, and improvement in the metrics. So this would support Max's notion, that two-class problems because of Bayes' probabilistic approach of plsda function both lead to the same results.

Question 3

You need to wrap the CV around both PLS and LDA.
Yes, both plsr and lda center the data their own way
I had a closer look at caret::preProcess (): as it is defined now, you will not be able to use PLS as preprocessing method because it is supervised but caret::preProcess () uses unsupervised methods only (there is no way to hand over the dependent variable). This would probably make patching rather difficult.
So inside the caret framework, you'll need to go for a custom model.

Question 4

If the scenario were to custom a model of PLS-LDA type, according to the code kindly provided by Max (maintainer of CARET), something is not corect in this code, but I didn't figure it out, because I used the Sonar data set the same in caret vignette and tried to reproduce the result one time using method="pls" and another time using the below custom model for PLS-LDA, the results were exactly identical even to the last digit, which was nonsensical. For benchmarking, one need a known data set (I think a cross-validated PLS-LDA for iris data set would fit here as it is famous for this type of analysis and there should be somewhere a cross-validated treatment of it), everything should be the same (the set.seed(xxx) and the no of K-CV repitition) except the code in question so as to rightly compare and to judge the code below:

modelInfo <- list(label = "PLS-LDA",
                  library = c("pls", "MASS"),
                  type = "Classification",
                  parameters = data.frame(parameter = c("ncomp"),
                                          class = c("numeric"),
                                          label = c("#Components")),
                  grid = function(x, y, len = NULL) {
                    grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
                  },
                  loop = NULL,
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    ## First fit the pls model, generate the training set scores,
                    ## then attach what is needed to the lda object to 
                    ## be used later
                    pre <- plsda(x, y, ncomp = param$ncomp)
                    scores <- pls:::predict.mvr(pre, x, type = "scores")
                    mod <- lda(scores, y, ...)
                    mod$projection <- pre$projection
                    mod
                  },
                  predict = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$class
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$posterior
                  },
                  varImp = NULL,
                  predictors = function(x, ...) rownames(x$projection),
                  levels = function(x) x$obsLevels,
                  sort = function(x) x[order(x[,1]),])

Based on Zach's request, the code below is for method="pls" in caret, exactly the same concrete example in caret vigenette on CRAN:

library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 3,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
                data = training,
                method = "pls",
                tuneLength = 15,
                trControl = ctrl,
                metric = "ROC",
                preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness

Now, the code below is a pilot run to classify Sonar data using the custom model PLS-LDA which is under question, it is expected to come up with any numbers apart from identical with those using PLS only:

set.seed(108)
plsldaFitSon <- train(Class ~ .,
                   data = training,
                   method = modelInfo,
                   tuneLength = 15,
                   trControl = ctrl,
                   metric = "ROC",
                   preProc = c("center", "scale"))

Now comparing the results between the two models:

getTrainPerf(plsFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381    pls
getTrainPerf(plsldaFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom

postResample(predict(plsFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954 
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954

So, the results are exactly the same which cannot be. As if the lda model were not added?