Custom parameter tuning for KNN in caret

https://stackoverflow.com/questions/19767528

03-07-2022
|

Pergunta

I have a k nearest neighbors implementation that let me compute in a single pass predictions for multiple values of k and for multiple subset of training and test data (e.g. all the folds in the K-fold cross validation, AKA resampling metrics). My implementation can also leverage multiple cores.

I would like to interface my method to be used with the caret package. I can easily build custom method for the train function. But this will result in multiple calls to the model fit (one for each parameter and fold combinations).

As far as I know, I can't indicate tuning strategies when using trainControl. The code source of train mention something about "seq" model fitting :

## There are two types of methods to build the models: "basic" means that each tuning parameter
## combination requires it's own model fit and "seq" where a single model fit can be used to
## get predictions for multiple tuning parameters.

But I can't see any way to actually use that with custom models.

Any clue on how to approach this ?

More generally, suppose that you have a model class where you can estimate prediction errors across multiple parameters using a single model fit (e.g. ala Linear Regression LOOCV Trick but for multiple parameter values too), how would you interface it in caret?

Here's some example code to set up a (empty) custom model in caret:

# Custom caret
library(caret)
learning_data = data.frame(y=sample(c("one","two","three"),200,replace=T))
learning_data = cbind(learning_data,matrix(runif(3*200),ncol=3))
testRatio=0.75
inTrain <- createDataPartition(learning_data$y, p = testRatio, list = FALSE)
trainExpr <- learning_data[inTrain,]
testExpr <- learning_data[-inTrain,]

trainClass <- trainExpr$y
testClass <- testExpr$y

trainExpr$y<-NULL
testExpr$y<-NULL
cv_opts = trainControl(method="cv", number=4,verboseIter=T)

my_knn <- function(data,weight,parameter,levels,last,...){
        print("training")
        # print(dim(data))
        # str(parameter)
        # list(fit=rdist(data$,data))
        list(fit=NA)
}
my_knn_pred <- function(object,newdata){
    print("testing")
    # str(object)
    # print(dim(newdata))
    return("one")
}

sortFunc <- function(x)  x[order(x$k),]
# Values of K to test
knn_opts = data.frame(.k=c(seq(7,11, 2))) #odd to avoid ties
custom_tr = trainControl(method="cv", number=4,verboseIter=T,   custom=list(parameters=knn_opts,model=my_knn,prediction=my_knn_pred,probability=NULL,sort=sortFunc))

# This will result in 12 calls, 6 to my_knn, 6 to my_knn_pred, one per combination of fold and parameter value
custom_knn_performances <- train(x = trainExpr, y = trainClass,method = "custom",trControl=custom_tr,tuneGrid=knn_opts)

I would like to control the training procedure so as to generate predictions for all folds and parameter values in a single call.

Solução

The current custom model fit parts of train don't allow for sequential parameters.

The next release will. All of the specific model code will no longer be hard-coded and will be modularized (including the sequential parameters).

The work is about 80% done and I hope to have it out before the end of the year. I want to do a lot of testing on this version.

Drop me an email if you would like to kick it around before it is released (no warranty though).

Max

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow