I figured out a fast and simple solution.
First, randomly permute the rows:
myD <- myD[sample(1:dim(myD)[1],replace=FALSE),]
Next, keep only the first row for each unique combination of x and y:
myD <- myD[!duplicated(myD[,c("x","y")]),]
Question
Suppose I have a data frame, myD, with the following columns: x, y, a, b.
I want to select unique combinations of x and y. That part is easy, just use unique on the first two columns. However, for each unique combination of x,y there are multiple values of a and b; I want to select a random row. I.e., among all of the rows that match a particular combination of x,y, I simply want to randomly select just one of the rows. Note that I don't want to independently sample a and b; they should come from the same row.
I was using ddply to do this:
ddply(myD, c("x","y"), summarize,
a=a[1],
b=b[1])
This of course gets the first pair of a,b for each combination of x,y; I was randomly permuting the entire data frame to achieve uniformity.
Anyway, this ddply command is extremely slow when the data frame has a million rows or more. Is there a faster way to do this?
Solution 3
I figured out a fast and simple solution.
First, randomly permute the rows:
myD <- myD[sample(1:dim(myD)[1],replace=FALSE),]
Next, keep only the first row for each unique combination of x and y:
myD <- myD[!duplicated(myD[,c("x","y")]),]
OTHER TIPS
I have not built data to test this on, but I have found dplyr
to be faster than plyr
, so this command:
library(dplyr)
df_sampled <- myD %.%
group_by(x, y) %.%
summarize(a = a[1], b = b[1])
Ought to give you better performance.
Since speed is important here I would suggest a combination of the data.table
package and the sample
function. data.table
can do many of the same things plyr
can do but much much faster. Something like this might work...
#Make fake data
set.seed(3)
myD <- data.frame(x=c("s","s","s","t","t","t"),y=c("u","u","v","v","w","w"),
a=rnorm(6),b=rnorm(6))
#See data
myD
# x y a b
# 1 s u -0.96193342 0.08541773
# 2 s u -0.29252572 1.11661021
# 3 s v 0.25878822 -1.21885742
# 4 t v -1.15213189 1.26736872
# 5 t w 0.19578283 -0.74478160
# 6 t w 0.03012394 -1.13121857
require("data.table")
myD <- data.table(myD)
myD[,rand.row:=sample(1:.N,1),by=c("x","y")]
myD <- myD[,list(a=a[rand.row],b=b[rand.row]),by=c("x","y","rand.row")]
myD
# x y rand.row a b
# 1: s u 1 -0.96193342 0.08541773
# 2: s v 1 0.25878822 -1.21885742
# 3: t v 1 -1.15213189 1.26736872
# 4: t w 2 0.03012394 -1.13121857