Not sure if you found what you were looking for, but I find some of these sheets less than helpful.
If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters
He are some rules of thumb for running GBM:
- The interaction.depth is 1, and on most data sets that seems
adequate, but on a few I have found that testing the results against
odd multiples up the max has given better results. The max value I
have seen for this parameter is floor(sqrt(NCOL(training))).
- Shrinkage: the smaller the number, the better the predictive value,
the more trees required, and the more computational cost. Testing
the values on a small subset of data with something like shrinkage =
shrinkage = seq(.0005, .05,.0005) can be helpful in defining the
ideal value.
- n.minobsinnode: default is 10, and generally I don't mess with that.
I have tried c(5,10,15,20) on small sets of data, and didn't really
see an adequate return for computational cost.
- n.trees: the smaller the shrinkage, the more trees you should have.
Start with n.trees = (0:50)*50 and adjust accordingly.
Example setup using the caret package:
getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
n.trees = (0:50)*50,
shrinkage = seq(.0005, .05,.0005),
n.minobsinnode = 10) # you can also put something like c(5, 10, 15, 20)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95),
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
distribution = "adaboost",
method = "gbm", bag.fraction = 0.5,
nTrain = round(nrow(training) *.75),
trControl = fitControl,
verbose = TRUE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC"))
Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow.
To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.