Increase random number set size fairly?

https://stackoverflow.com/questions/16904719

30-05-2022
|

Question

Math/programming question that has arisen while I'm trying to deal with using a set of random data as an entropy source. In a situation where I'm using something like Random.org's pregenerated random files as an entropy source. Raw data like this is random zeroes and ones, and could be bit off as random bytes (0-255) or larger ranges as powers of two. I'm trying to be as efficient as possible in using this random source, since it's finite in length, so I don't want to use a larger set than I need.

Taking random bytes is fair if you want a number from a range evenly divisible by 256 (e.g. 100 to 355, 0 to 15, etc.). However, what if I want a number from 1 to 100? That doesn't fit nicely in 256. I could assign 0-199 to the 1-100 range twice over, leaving 200-255 as extra that would have to be discarded if drawn, or else 55 numbers in the range would be unfairly weighted to come up more often.

Is throwing out the out-of-range numbers the only fair option? Or is there a mathematical way to fairly "blur" those 55 numbers over the 1-100 range?

The only other option I've come up with to know I will be able to use the number and not throw out results is to absorb a larger number of bytes, so that the degree of bias is less (0-255 would have some numbers in 1-100 with two "draws", and some with three; 3:2 odds = 50% more likely. Ten bytes (0-2,550) would have 26:25 odds = 4% more likely. Etc.) That uses up more data, but is more predictable.

Is there a term for what I'm trying to do (can't Google what I can't name)? Is it possible, or do I have to concede that I'll have to throw out data that doesn't fairly match the range I want?

Solution

If you use 7 bits per number, you get 0-127. Whenever you get a number greater than 100, you have to discard it. You lose the use of that data point but its still random. You lose 28 of every 128 or about 20% of the random information.

If you use 20 bits at a whack, you get a number between 0 and 1,048,575. This can be broken into 3 random values between 0 and 99 (or 1-100 if you add 1 to it). You have to use integer arithmetic or throw away any fractional part when dividing.

if (number > 1000000) discard it.
a = number % 100;
b = (number / 100) % 100;
c = (number / 10000) % 100;

You only waste 48,575 values out of 1048575 or about 5% of the random information.

You can think of this process this way. Take the number you get by converting 20 bits to an decimal integer. Break out the 10's and 1's digits, the 1000's and 100's digits and the 100,000's and 10,000's digits and use those as three random numbers. They are truly random since those digits could be any value at all in the original number. Further, we discarded any values that bias particular values of the three.

So there's a way to make more efficient use of the random bits. But you have to do some computing.

Note: The next interesting bit combination is 27 bits and that wastes about 25%. 14 bits would waste about 60%.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow