Question

I'm developing an R package which requires me to report percentile ranks for each of the returned values. However, the distribution I have is huge (~10 million values).

The way I'm currently doing it is by generating an ecdf function, saving that function to a file and reading it in the package when needed. This is problematic because the file I save ends up being huge (~120mb) and takes too long to load back in:

f = ecdf(rnorm(10000000))
save(f, file='tmp.Rsav')

Is there anyway to make this more efficient maybe somehow by approximating the percentile rank in R?

Thanks

Was it helpful?

Solution

Just do an ecdf on a downsampled distro:

> items <- 100000
> downsample <- 100 # downsample by a factor of 100
> data <- rnorm(items)
> data.down <- sort(data)[(1:(items / downsample)) * downsample] # pick every 100th
> round(ecdf(data.down)(-5:5), 2)
 [1] 0.00 0.00 0.00 0.02 0.16 0.50 0.84 0.98 1.00 1.00 1.00
> round(ecdf(data)(-5:5), 2)
 [1] 0.00 0.00 0.00 0.02 0.16 0.50 0.84 0.98 1.00 1.00 1.00

Note you probably want to think about the downsampling a little bit as the example here will return slightly biased answers, but the general strategy should work.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top