Sample a subset of dataframe by group, with sample size equal to another subset of the dataframe

StackOverflow https://stackoverflow.com/questions/21946485

  •  14-10-2022
  •  | 
  •  

Question

Here's my hypothetical data frame;

location<- as.factor(rep(c("town1","town2","town3","town4","town5"),100))
visited<- as.factor(rbinom(500,1,.4)) #'Yes or No' variable
variable<- rnorm(500,10,2)
id<- 1:500
DF<- data.frame(id,location,visited,variable)

I want to create a new data frame where the number of 0's and 1's are equal for each location. I want to accomplish this by taking a random sample of the 0's for each location (since there are more 0's than 1's).

I found this solution to sample by group;

library(plyr)
ddply(DF[DF$visited=="0",],.(location),function(x) x[sample(nrow(x),size=5),])

I entered '5' for the size argument so the code would run, But I can't figure out how to set the 'size' argument equal to the number of observations where DF$visited==1.

I suspect the answer could be in other questions I've reviewed, but they've been a bit too advanced for me to implement.

Thanks for any help.

Was it helpful?

Solution

The key to using ddply well is to understand that it will:

  1. break the original data frame down by groups into smaller data frames,
  2. then, for each group, it will call the function you give it, whose job it is to transform that data frame into a new data frame*
  3. and finally, it will stitch all the little transformed data frames back together.

With that in mind, here's an approach that (I think) solves your problem.

sampleFunction <- function(df) {
  # Determine whether visited==1 or visited==0 is less common for this location, 
  # and use that count as our sample size.
  n <- min(nrow(df[df$visited=="1",]), nrow(df[df$visited=="0",]))
  # Sample n from the two groups (visited==0 and visited==1).
  ddply(df, .(visited), function(x) x[sample(nrow(x), size=n),])
}

newDF <- ddply(DF,.(location),sampleFunction)

# Just a quick check to make sure we have the equal counts we were looking for.
ddply(newDF, .(location, visited), summarise, N=length(variable))

How it works

The main ddply simply breaks DF down by location and applies sampleFunction, which does the heavy lifting.

sampleFunction takes one of the smaller data frames (in your case, one for each location), and samples from it an equal number of visited==1 and visited==0. How does it do this? With a second call to ddply: this time, using location to break it down, so we can sample from both the 1's and the 0's.

Notice, too, that we're calculating the sample size for each location based on whichever sub-group (0 or 1) has fewer occurrences, so this solution will work even if there aren't always more 0's than 1's.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top