Question

I need to select 3.7*10^8 unique values from the range [0, 3*10^9] and either obtain them in order or keep them in memory.

To do this, I started working on a simple algorithm where I sample smaller uniform distributions (that fit in memory) in order to indirectly sample the large distribution that really interests me.

The code is available at the following gist https://gist.github.com/legaultmarc/7290ac4bef4edb591d1e

Since I'm having trouble implementing something more robust, I was wondering if you had other ideas to sample unique values from a large discrete uniform. I'm looking for either an algorithm, a module or an idea on how to manage very large lists directly (perhaps using the hard drive instead of memory).

Was it helpful?

Solution

There is an interesting post, Generating sorted random ints without the sort? O(n) which suggests that instead of generating uniform random ints, you can do a running-sum on exponential random deltas, which gives you a uniform random result generated in sorted order.

It's not guaranteed to give exactly the number of samples you want, but should be pretty close, and much faster / lower memory requirements.

Edit: I found a second post, generating sorted random numbers without exponentiation involved? which suggests tweaking the distribution density as you go to generate an exact number of samples, but I am leery of just exactly what this would do to your "uniform" distribution.

Edit2: Another possibility that occurs to me would be to use an inverse cumulative binomial distribution to iteratively split your sample range (predict how many uniformly generated random samples would fall in the lower half of the range, then the remainder must be in the upper half) until the block-size reaches something you can easily hold in memory.

OTHER TIPS

This is a standard sample with out replacement. You can't divide the range [0, 3*10^9] into equally binned ranges and sample same amount in each bin. Also, 3 billion is relative large, many "ready to use" codes only handle 32 bit integers, roughly 2 billion(+-). Please take a close look at their implementations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top