Вопрос

I have two vector e and g. I want to know for each element in e the percentage of elements in g that are smaller. One way to implement this in R is:

set.seed(21)
e <- rnorm(1e4)
g <- rnorm(1e4)
mf <- function(p,v) {100*length(which(v<=p))/length(v)}
mf.out <- sapply(X=e, FUN=mf, v=g)

With large e or g, this takes a lot of time to run. How can I change or adapt this code to make this run faster?

Note: The mf function above is based on code from the mess function in the dismo package.

Это было полезно?

Решение

The reason this is so slow is because you're calling your function length(e) times. It doesn't make a large difference for small vectors, but the overhead from R function calls really starts to add up with larger vectors.

Normally, you would need to move this to compiled code, but luckily you can use findInterval:

set.seed(21)
e <- rnorm(1e4)
g <- rnorm(1e4)
O <- findInterval(e,sort(g))/length(g)

# Now for some timings:
f <- function(p,v) mean(v<=p)
system.time(o <- sapply(e, f, g))
#   user  system elapsed 
#   0.95    0.03    0.98
system.time(O <- findInterval(e,sort(g))/length(g))
#   user  system elapsed 
#      0       0       0 
identical(o,O)  # may be FALSE
all.equal(o,O)  # should be TRUE

# How fast is this on large vectors?
set.seed(21)
e <- rnorm(1e7)
g <- rnorm(1e7)
system.time(O <- findInterval(e,sort(g))/length(g))
#   user  system elapsed 
#  22.08    0.08   22.31
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top