Create column in dataframe that samples from another column by factor levels

https://stackoverflow.com/questions/23228089

07-07-2023
|

Question

I would like column x3 of my dataframe dat to contain a random sample of column x2 but the random sample should only come from the same factor level given in column x1. I have researched the functions by(), ddply(), and sample(), but can't seem to make it work. I also checked a similar question but it didn't help me. You can see what I tried in the context of (what I hope is) a reproducible example below.

Here is the example dataframe:

dat <- data.frame(x1=c("a","a","a","b","b","b","c","c","c"),x2=1:9);
dat$x1 <- as.factor(dat$x1);
dat;
  x1 x2
1  a  1
2  a  2
3  a  3
4  b  4
5  b  5
6  b  6
7  c  7
8  c  8
9  c  9

Then some of my non-working attempts to generate x3 were the following:

set.seed(99);
by(dat,FUN=dat$x1,dat$x3<-sample(dat$x1,1,replace=FALSE));  #this did not work at all

I also tried this

set.seed(99);
a <- by(dat,dat[,"x1"],function(d){sample(d$x2,3,replace=FALSE)},simplify=TRUE);
dat$x3<-a;
a;
dat[, "x1"]: a
[1] 2 1 3
--------------------------------------------------------------------------------------------------- 
dat[, "x1"]: b
[1] 6 5 4
--------------------------------------------------------------------------------------------------- 
dat[, "x1"]: c
[1] 9 7 8
dat;
> dat
  x1 x2      x3
1  a  1 2, 1, 3
2  a  2 6, 5, 4
3  a  3 9, 7, 8
4  b  4 2, 1, 3
5  b  5 6, 5, 4
6  b  6 9, 7, 8
7  c  7 2, 1, 3
8  c  8 6, 5, 4
9  c  9 9, 7, 8

I kind of got what I needed into a in that the random resampling by factor level is there but a is not a simple vector. I feel that if a was a vector I would just about have what I need as I could assign it to dat$x3. To sum up, I would want dat to turn out something like this:

The solution should be efficient for a dataframe with >2 million rows. Thanks anyone for your help. I hope to return the help to others as I get better with r.

Solution

 dat$x3 <- ave( dat$x2, dat$x1, FUN=sample)

The way you have constructed the output (to have the same number of entries as there were rows of the dataframe) you will get permutations of x2 values within distinct values of x1. (Edited your code to make it run.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow