Question

I have two vectors "H" and "L" that have 200 numeric values. I want to create a third vector called "HL" that contains 200 random samples from H and L. However, I would like them to be selected in parallel, the same way the pmin and pmax function do.

Simplified example:

H <- 1:5
L <- 6:10

# rbind(H,L)
#   [,1] [,2] [,3] [,4] [,5]
# H    1    2    3    4    5
# L    6    7    8    9   10
# intended result is then a random pick from each 'column' shown above, e.g:

HL <- c(6,2,8,4,10)

Is there a way of doing this without using a loop?

Any advice would be much appreciated Thanks

Was it helpful?

Solution

You simpliy need N samples from a bernouli (ie, 0 or 1) distribution, where N is the number of values in H/L. You then use the sampling to pick from H or L respectively. using ifelse ensures the "parallel selection" you require.

set.seed(1)
N <- length(H)
HorL <- rbinom(N, 1, 0.5)

# the select
results <- ifelse(HorL, H, L)

results
# [1]  6  7  3  4 10

This all wraps up as a nice one liner:

ifelse( rbinom(H, 1, 0.5), H, L)

from @Arun: A (relatively) faster way of implementing this (removing the need for ifelse) would be:

idx <- which(!as.logical(rbinom(H, 1, 0.5)))
vv <- H
vv[idx] <- L[idx]

EXPLANATION

@Gabriel, The idea is that you are selecting from one of two options. You can effectively flip a coin and, if heads, select from H, if tails, select from L. This is a Bernouli Distribution, a more general form is the Binomial distribution. R has a facility to offer random numbers of just such a fashion.

Thus we ask R for N many of these, then select from H or L accordingly.

The "select from .. " part is the R trickery.

  • Notice that we can think of 0, 1 as TRUE, FALSE or A, B, etc.

  • Using the ifelse approach should be somewhat self explanatory. If it is TRUE, select from one source, if it is FALSE, select from the other.

Arun's approach is more creative. His approach uses the same "flip a coin" mechanism for choosing between sets, but has the benefit of speed. (We are speaking nanoseconds, but still). His approach essentially says:

  • Start with one group, say H.
  • Flip a coin.
  • Whenever the coin is Tails, replace that element of H with the same indexed element of L. (Notice that the "same index" aspect is what you are refering to as "parallel selection")

OTHER TIPS

library(data.table)
set.seed(1350)

# Create an example data table:
dt <- data.table(ID=1:200,H=sample(1:1000,200),L=sample(1001:2000,200),key="ID")
# (If you already have a data frame 'df', you can use):
# dt <- as.data.table(df)

set.seed(5655)
# Add a column that randomly samples between H and L:
dt[,HL:=sample(c(H,L),1),by=ID]
dt

#       ID   H    L   HL
#  1:   1 837 1391 1391
#  2:   2 999 1573 1573
#  3:   3 566 1275  566
#  4:   4 347 1709 1709
#  5:   5 129 1627  129
# ---                  
#196: 196  67 1879 1879
#197: 197 652 1811 1811
#198: 198 569 1160 1160
#199: 199  17 1026   17
#200: 200 221 1500 1500

Edit 2: My initial answer would give incorrect values if H had duplicates, as pointed in the comments. I had added timings that showed data.table was faster, but when I correct the answer it does turn out to be much slower, as suggested in the comments. (It was faster with the wrong answer since it was grouping by duplicate values, so it had many fewer rows to consider...)

In short, I was wrong, and you might be better off with the other answer.


here is a proper benchmark :

set.seed(1350) 

H <- sample(1:200, 200) 
L <- sample(201:400, 200)

usingDataTable <- quote({
  dt <- data.table(H, L)
  dt[,HL:=sample(c(H,L),1),by=H]
})


dt2 <- data.table(H, L)
usingDataTable.NoInitialize <- quote({
  dt2[,HL:=sample(c(H,L),1),by=H]
})

usingVectors <- quote ({
  ifelse( rbinom(H, 1, 0.5), H, L)
})



microbenchmark(eval(usingVectors), eval(usingDataTable), eval(usingDataTable.NoInitialize), times=100L)

Unit: microseconds
                              expr      min       lq   median        uq      max neval
                eval(usingVectors)   55.021   61.148   66.760   69.4605 1682.163   100
              eval(usingDataTable) 1635.676 1745.437 1795.245 1851.0950 3629.179   100
 eval(usingDataTable.NoInitialize) 1458.573 1537.618 1596.237 1669.3750 3683.756   100
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top