Question

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages

Was it helpful?

Solution 2

This link has a concrete example (page 10) - http://www.jstatsoft.org/v28/i05/paper

Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). Then call the generic train function as usual.

OTHER TIPS

Not sure if you found what you were looking for, but I find some of these sheets less than helpful.

If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters

He are some rules of thumb for running GBM:

  1. The interaction.depth is 1, and on most data sets that seems adequate, but on a few I have found that testing the results against odd multiples up the max has given better results. The max value I have seen for this parameter is floor(sqrt(NCOL(training))).
  2. Shrinkage: the smaller the number, the better the predictive value, the more trees required, and the more computational cost. Testing the values on a small subset of data with something like shrinkage = shrinkage = seq(.0005, .05,.0005) can be helpful in defining the ideal value.
  3. n.minobsinnode: default is 10, and generally I don't mess with that. I have tried c(5,10,15,20) on small sets of data, and didn't really see an adequate return for computational cost.
  4. n.trees: the smaller the shrinkage, the more trees you should have. Start with n.trees = (0:50)*50 and adjust accordingly.

Example setup using the caret package:

getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
                    n.trees = (0:50)*50, 
                    shrinkage = seq(.0005, .05,.0005),
                    n.minobsinnode = 10) # you can also put something        like c(5, 10, 15, 20)

fitControl <- trainControl(method = "repeatedcv",
                       repeats = 5,
                       preProcOptions = list(thresh = 0.95),
                       ## Estimate class probabilities
                       classProbs = TRUE,
                       ## Evaluate performance using
                       ## the following function
                       summaryFunction = twoClassSummary)

# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
            distribution = "adaboost",
            method = "gbm", bag.fraction = 0.5,
            nTrain = round(nrow(training) *.75),
            trControl = fitControl,
            verbose = TRUE,
            tuneGrid = gbmGrid,
            ## Specify which metric to optimize
            metric = "ROC"))

Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow. To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top