Random stratified sample from each factor levels

https://stackoverflow.com//questions/20012911

21-12-2019
|

Question

I have data set with two factors: Environments (4 levels), Individuals (500 in each environment) and response variable YD. As part of my analysis I have to randomly sample 100 individuals from each Environment in the following way:

Same 100 individuals across all the four environments(100 individuals)
Different 100 individuals from each environment (100*4=400 individuals)
Same 100 in in 2 environment and another 100 in the other 2 environments (2*100=200 individuals)

I already solved this problem using several lines of code, however I hope someone will help me generate an R function to do that, which will be very useful in other situations.

Here is example data set with similar structure:

  library(BLR)
  data (wheat)
  Data <- melt(Y)
  colnames(Data) <- c('Individuals','Environments','YD')

Solution

updated answer:

I just wrapped the answer in a function. Note, that is valid only for exactly 4 levels.

colnames(Data)<-c("Individuals","Environments","YD") #removed spaces from names

myfun <- function(DF, samplefrom, samplelevels, sampletype, samplesize)
{
 if(sampletype == "per1")
  {
   Env1 = sample(unique(DF[[samplefrom]]), samplesize)
   Env2 <- Env3 <- Env4 <- Env1
  }
 if(sampletype == "per4")
  {
   Env1 = sample(unique(DF[[samplefrom]]), samplesize)
   Env2 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% Env1], samplesize)
   Env3 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% c(Env1, Env2)], samplesize)
   Env4 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% c(Env1, Env2, Env3)], samplesize)
  }
 if(sampletype == "per2")
  {
   Env1 = sample(unique(DF[[samplefrom]]), samplesize)
   Env2 <- Env1
   Env3 = sample(unique(DF[[samplefrom]])[!unique(DF[[samplefrom]]) %in% Env1], samplesize)
   Env4 <- Env3
  } 

  ret = do.call(rbind, mapply(function(ind, env) {df <- Data[DF[[samplelevels]] == env,]; 
                                                 df[df[[samplefrom]] %in% ind,]},
       env = as.list(sample(unique(DF[[samplelevels]]))), ind = list(Env1, Env2, Env3, Env4), 
            SIMPLIFY = F))    #in `env = ` added `sample` to select the environments 
                               #in random order and assign them the individuals

  return(ret)
}
myfun(Data, "Individuals", "Environments", "per1", 2)
#     Individuals Environments         YD
#21         13954            1  0.6658681
#345       457982            1 -1.1022770
#620        13954            2 -0.4888968
#944       457982            2  0.6026167
#1219       13954            4 -0.7183965
#1543      457982            4  0.4881141
#1818       13954            5  0.2660623
#2142      457982            5 -2.0626073
myfun(Data, "Individuals", "Environments", "per2", 2)
#     Individuals Environments         YD
#25         15292            1 -1.1272386
#248       373045            1 -0.6659416
#624        15292            2 -0.2362053
#847       373045            2  0.5778210
#1260       62150            4  1.2077921
#1654     1541043            4  1.1406084
#1859       62150            5 -0.3358584
#2253     1541043            5  0.3897426
myfun(Data, "Individuals", "Environments", "per4", 2)
#     Individuals Environments         YD
#106        85786            1  1.4480500
#567      3830162            1 -1.8052577
#1029     1301802            2  0.2737786
#1043     1410845            2  1.0617118
#1630     1302304            4  0.6673241
#1678     1766332            4 -0.0451913
#1871       65315            5 -0.0597450
#2336     2621166            5  2.5590801

update 2 some comments

mapply applies a function sequentially to multiple arguments. Here, the function takes two arguments: ind and env. The function 1) subsets the dataframe by env and 2) subsets the subsetted dataframe by ind. env is an environment and ind is the sample of individuals (Env1, ...) previously calculated in myfun. The multiple arguments of the function to be mapplied are env: [1, 2, 3, 4] and ind: [Env1, Env2, Env3, Env4]. mapply takes sequentially env = 1 and ind = Env1, env = 2 and ind = Env2 etc, and gives the result (the necessary subsets) in a list. do.call(rbind,) joins the list in a dataframe output.

P.S. Note that because sample is used env can be [1, 2, 3, 4] or [2, 4, 3, 1] or whatever and so the sequential combination of the function's (to be mapplied) arguments is not only env = 1 and ind = Env1 but env = 1 or 2 or 3 or 4 and ind = Env1, and so on.

update 3 and 4 function with different No levels

No_different_samples is the number of different samples you wish to take; I made it to default to the number of samplelevels (i.e. a different sample for every level). I made the function to give an error if the No_different_samples can't fit inthe No levels (i.e. if you want 3 different samples from a population with 4 levels (as your Data), it throws an error; you have to select either 1 or 2 or 4.

myfun2 <- function(DF, samplefrom, samplelevels, 
               No_different_samples = NULL, grouping = NULL, samplesize)
{
 samp <- sample(unique(DF[[samplefrom]]))
 levs <- unique(DF[[samplelevels]])

 if(is.null(No_different_samples)) No_different_samples <- length(levs)
 if(is.null(grouping)) grouping <- c(1, 1, 1, 1)    

 if(length(levs) %% No_different_samples) stop("an error message here")
 if(length(samp) < No_different_samples * samplesize) 
        stop("can't take a sample this large from the population")

 ls_diffr_samps <- vector("list", length = No_different_samples)
 for(i in 1:No_different_samples)
  { 
   ls_diffr_samps[[i]] <- samp[(i * samplesize - (samplesize - 1)) : (i * samplesize)]
  }

 list_samples <- rep(ls_diffr_samps, times = grouping)  

 ret = do.call(rbind, mapply(function(ind, env) {df <- DF[DF[[samplelevels]] == env,]; 
                                                 df[df[[samplefrom]] %in% ind,]},
       env = as.list(sample(levs)), ind = list_samples, 
            SIMPLIFY = F))     

  return(ret)
}

myfun2(Data, "Individuals", "Environments", 1, 4, 2) #same sample for all
myfun2(Data, "Individuals", "Environments", 2, c(2, 2), 2) #same sample per 2
myfun2(Data, "Individuals", "Environments", 2, c(3, 1), 2) #same sample for 3    
myfun2(Data, "Individuals", "Environments", 4, c(1, 1, 1, 1), 2) #different sample for all

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow