As far as I know, the sampsize
argument should be a vector that is the same length as the number of classes in your data set. If you specify a factor variable in the strata
argument, then sampsize
should be given a vector that is the same length as the number of factors in the strata
argument. I am not sure it performs as you describe in your question, but it has been a while since I have used the randomForest
function.
From the help files, it says:
strata
A (factor) variable that is used for stratified sampling.
sampsize
:Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
For example, since your classification has 2 distinct classes, you need to give sampsize
a vector of length 2 that specifies how many observations you want to sample from each class during training time.
e.g. sampsize=c(100,50)
Furthermore, you can specify the names of the groups to be extra clear.
e.g. sampsize=c('0'=100, '1'=50)
An example from the help files that uses the sampsize
argument, to clarify:
## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))
EDIT: Added some notes about the strata
argument in randomForest
.
EDIT: Make sure the strata
argument is given a factor variable!
e.g. try strata = factor(HH_Pres), sampsize = c(...)
where c(...)
is a vector that is the same length as length(levels(factor(bll_HH$HH_Pres)))
EDIT:
OK, I tried running the code with your data, and it works for me.
# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)
# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 425 15 0.03409091
# 1 86 8 0.91489362
# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 424 16 0.03636364
# 1 85 9 0.90425532
Note that the OOB error estimate is the same in this case, even if we only use 63.2% of the data from each of the classes in our bootstrap samples. This is probably due to using sample sizes that are proportional to the class distribution in your training data, and the relatively small size of your data set. Let's try changing mySampSize
to make sure it REALLY worked.
# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 21.16%
# Confusion matrix:
# 0 1 class.error
# 0 382 58 0.1318182
# 1 55 39 0.5851064