The key to using ddply
well is to understand that it will:
- break the original data frame down by groups into smaller data frames,
- then, for each group, it will call the function you give it, whose job it is to transform that data frame into a new data frame*
- and finally, it will stitch all the little transformed data frames back together.
With that in mind, here's an approach that (I think) solves your problem.
sampleFunction <- function(df) {
# Determine whether visited==1 or visited==0 is less common for this location,
# and use that count as our sample size.
n <- min(nrow(df[df$visited=="1",]), nrow(df[df$visited=="0",]))
# Sample n from the two groups (visited==0 and visited==1).
ddply(df, .(visited), function(x) x[sample(nrow(x), size=n),])
}
newDF <- ddply(DF,.(location),sampleFunction)
# Just a quick check to make sure we have the equal counts we were looking for.
ddply(newDF, .(location, visited), summarise, N=length(variable))
How it works
The main ddply
simply breaks DF
down by location and applies sampleFunction
, which does the heavy lifting.
sampleFunction
takes one of the smaller data frames (in your case, one for each location), and samples from it an equal number of visited==1
and visited==0
. How does it do this? With a second call to ddply
: this time, using location
to break it down, so we can sample from both the 1's and the 0's.
Notice, too, that we're calculating the sample size for each location based on whichever sub-group (0 or 1) has fewer occurrences, so this solution will work even if there aren't always more 0's than 1's.