Question

I have a data table that contains a column of bins and a column of values. The bins repeat within the data frame. I want to select a pre-determined number of values from each bin. That pre-determined number can be found by looking up the bin number in a reference data frame that contains the bin number in one column and the corresponding num.to.sample value in the second column. The num.to.sample value should be used to select values from that bin within a sampling function.

    #Example data
    data = as.data.frame(cbind(rep(1:3, each=6)))
    colnames(data) = "bin"
    data$value = rnorm(18)

    #Reference file used to determine how many data$values to select based on data$bin
    ref = as.data.frame(cbind(1:3))
    colnames(ref) = "bin"
    ref$num.to.sample = c(1,2,3)

    #Sample function
    #num should be determined by the num.to.sample value that the bin matches to in ref
    samples = function(x, num){
      sample(x, num, replace=FALSE);
    }

    #this code below works for selecting a specific number of values by bin
    #how can this be turned into the num.to.sample value that would result from matching
    #data$bin to ref$bin and returning ref$num.to.sample?
    data.sample = data[unlist(tapply(1:nrow(data),data$bin, function(x) samples(x,2))),]
    data.sample

Any ideas?

Thanks!

Was it helpful?

Solution

There are probably better ways, but as a first pass you could use

data <- merge(data, ref)

library(plyr)
ddply(data, "bin", function(x) x[sample(1:nrow(x), unique(x$num.to.sample)), ])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top