Avoiding Exact Repeats for Mersenne Twister in Python

https://stackoverflow.com/questions/14735468

07-03-2022
|

Question

As many people know, Python uses the Mersenne Twister (MT) algorithm to handle its random numbers. However, despite having a very long period (~2^19937), it is also well-known that you can't reach every random permutation when you shuffle a sequence greater than 2080 elements (since !2081 > 2^19937). As I am dealing with permutations and statistical properties are important to me, I'm trying to figure out the best way to mix or re-seed the Python generator with an additional source of randomness to avoid repetition.

Currently, my concept is to use the system random number generator (SystemRandom) to add an external source of randomness to the MT generator. There are two ways that I can think of to do this:

XOR the SystemRandom random number with the MT random number
Use the SystemRandom to reseed the MT

The first approach is used with some frequency by hardware random number generators, to reduce their bias tendencies. However, it's wildly inefficient. On a Windows XP machine, the SystemRandom is 50 times slower than the standard Python random function. That's a huge performance hit when most of your function involves shuffling. Given that, reseeding the MT with the SystemRandom should be significantly more efficient.

However, there are two issues with that approach also. Firstly, reseeding the MT during operation might disrupt its statistical properties. I'm fairly certain this shouldn't be an issue if the MT runs long enough, as each run of MT values should be well-formed (regardless of the starting point). It does however indicate that a sizable period between MT reseeding is preferred. Secondly, there is the question of what is the most efficient way to trigger reseeding. The simplest way to handle this is with a counter. However, more efficient ways might be possible.

So then, there are three questions on this point:

Has anyone read anything to the effect that reseeding an MT with a random value after every N samples will alter its desirable statistical properties?
Does anyone know a more efficient way to do so than incrementing a counter to trigger a reseed?
Finally, if anyone knows a generally better way to approach this problem, I'm all ears.

Solution

Reseeding will not help you. It will just jump somewhere else into the (very very) long MT sequence. Are-you sure shuffling your data gives you a biased result? Because you will never have enough hours in the universe lifetime to generate all possible sequences. So even if you know that some sequences can possibly never be generated, it doesn't mean that generated sequences will be biased. I guess your best bet is simply to use the shuffle command as it.

If you look at numpy.random.shuffle source code (line 4376), here is basically the algorithm used (I simplified it for clarity):

i = len(x) - 1
while i > 0:
    j = randint(0, i)
    x[i], x[j] = x[j], x[i]
    i = i - 1

In other words, beginning from the end, it swaps the value with a random value taken randomly before it in the array, until all values are swapped. The final state not only depends on the random generator, but also on the initial state of the array. It means that in theory, you should be able to visit all permutations, if you perform enough shuffles.

OTHER TIPS

I realise this was more than a year ago, but in case you glance back, there's an easy solution: Just grab a new value from SystemRandom to XOR with the output from the MT RNG every kth time, for some sufficiently large k, instead of every time. E.g. if SystemRandom is 50 times slower and you set k = 5000, your new combined RNG should only be ~1% slower, and (assuming that SystemRandom is "really" random) any permutation can be reached in every run involving more than 5000 RNG calls.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow