select unique combinations of some columns in R, and random value for another column

StackOverflow https://stackoverflow.com/questions/23588826

  •  19-07-2023
  •  | 
  •  

Question

Suppose I have a data frame, myD, with the following columns: x, y, a, b.

I want to select unique combinations of x and y. That part is easy, just use unique on the first two columns. However, for each unique combination of x,y there are multiple values of a and b; I want to select a random row. I.e., among all of the rows that match a particular combination of x,y, I simply want to randomly select just one of the rows. Note that I don't want to independently sample a and b; they should come from the same row.

I was using ddply to do this:

ddply(myD, c("x","y"), summarize,
        a=a[1],
        b=b[1])

This of course gets the first pair of a,b for each combination of x,y; I was randomly permuting the entire data frame to achieve uniformity.

Anyway, this ddply command is extremely slow when the data frame has a million rows or more. Is there a faster way to do this?

Was it helpful?

Solution 3

I figured out a fast and simple solution.

First, randomly permute the rows:

myD <- myD[sample(1:dim(myD)[1],replace=FALSE),]

Next, keep only the first row for each unique combination of x and y:

myD <- myD[!duplicated(myD[,c("x","y")]),]

OTHER TIPS

I have not built data to test this on, but I have found dplyr to be faster than plyr, so this command:

library(dplyr)

df_sampled <- myD %.%
group_by(x, y) %.% 
summarize(a = a[1], b = b[1])

Ought to give you better performance.

Since speed is important here I would suggest a combination of the data.table package and the sample function. data.table can do many of the same things plyr can do but much much faster. Something like this might work...

#Make fake data
set.seed(3)
myD <- data.frame(x=c("s","s","s","t","t","t"),y=c("u","u","v","v","w","w"),
    a=rnorm(6),b=rnorm(6))

#See data
myD
# x y           a           b
# 1 s u -0.96193342  0.08541773
# 2 s u -0.29252572  1.11661021
# 3 s v  0.25878822 -1.21885742
# 4 t v -1.15213189  1.26736872
# 5 t w  0.19578283 -0.74478160
# 6 t w  0.03012394 -1.13121857

require("data.table")

myD <- data.table(myD)
myD[,rand.row:=sample(1:.N,1),by=c("x","y")]
myD <- myD[,list(a=a[rand.row],b=b[rand.row]),by=c("x","y","rand.row")]
myD

#   x y  rand.row       a           b
# 1: s u        1 -0.96193342  0.08541773
# 2: s v        1  0.25878822 -1.21885742
# 3: t v        1 -1.15213189  1.26736872
# 4: t w        2  0.03012394 -1.13121857
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top