Stratified sampling doesn't seem to change randomForest results

Question 1

As far as I know, the sampsize argument should be a vector that is the same length as the number of classes in your data set. If you specify a factor variable in the strata argument, then sampsize should be given a vector that is the same length as the number of factors in the strata argument. I am not sure it performs as you describe in your question, but it has been a while since I have used the randomForest function.

From the help files, it says:

strata

A (factor) variable that is used for stratified sampling.

sampsize:

Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

For example, since your classification has 2 distinct classes, you need to give sampsize a vector of length 2 that specifies how many observations you want to sample from each class during training time.

e.g. sampsize=c(100,50)

Furthermore, you can specify the names of the groups to be extra clear.

e.g. sampsize=c('0'=100, '1'=50)

An example from the help files that uses the sampsize argument, to clarify:

## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))

EDIT: Added some notes about the strata argument in randomForest.

EDIT: Make sure the strata argument is given a factor variable!

e.g. try strata = factor(HH_Pres), sampsize = c(...) where c(...) is a vector that is the same length as length(levels(factor(bll_HH$HH_Pres)))

EDIT:

OK, I tried running the code with your data, and it works for me.

# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)

# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                      Slope + MoonPhase + Chla + Region,
                    data=bll_HH, ntree = 500, replace = FALSE, 
                    importance = TRUE, na.action = na.omit)
HHrf

# Output
#         OOB estimate of  error rate: 18.91%
# Confusion matrix:
#     0  1 class.error
# 0 425 15  0.03409091
# 1  86  8  0.91489362

# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                       Slope + MoonPhase + Chla + Region,
                     data=bll_HH, ntree = 500, replace = FALSE, 
                     importance = TRUE, na.action = na.omit,
                     sampsize=mySampSize)
HHrf
# Output
#         OOB estimate of  error rate: 18.91%
# Confusion matrix:
#     0  1 class.error
# 0 424 16  0.03636364
# 1  85  9  0.90425532

Note that the OOB error estimate is the same in this case, even if we only use 63.2% of the data from each of the classes in our bootstrap samples. This is probably due to using sample sizes that are proportional to the class distribution in your training data, and the relatively small size of your data set. Let's try changing mySampSize to make sure it REALLY worked.

# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                       Slope + MoonPhase + Chla + Region,
                     data=bll_HH, ntree = 500, replace = FALSE, 
                     importance = TRUE, na.action = na.omit,
                     sampsize=mySampSize)
HHrf
# Output
#         OOB estimate of  error rate: 21.16%
# Confusion matrix:
#     0  1 class.error
# 0 382 58   0.1318182
# 1  55 39   0.5851064

Question 2

This syntax seems to be working fine for me on your data. The OOB is 32.21% and the class error(s): 0.32, 0.29. I did kick up the number of Bootstraps to 1000. I always recommend using indexing to define a RF model. In certain circumstances, symbolic syntax seems to be unstable.

require(randomForest)
  HHrf <- read.csv("bll_HH.csv")
    set.seed(25)    
( rf.mdl <- randomForest( y=as.factor(HHrf[,"HH_Pres"]), x=HHrf[,2:ncol(HHrf)],
                          strata=as.factor(HHrf[,"HH_Pres"]), sampsize=c(50,50),
                          ntree=1000) )

Question 3

I ran into this problem too. What I noticed is that my error rate when using importance = TRUE changes significantly. It is not the same as if I did not choose stratification with sampling.

For me it ended up being a tradeoff in not having an importance/accuracy score for my classification tree. It appears to be one of many bugs in this implementation.

Stratified sampling doesn't seem to change randomForest results

Without stratification

With stratification