Question

In my current C++11 project I need to perform M simulations. For each simulation m = 1, ..., M, I randomly generate a data set by using a std::mt19937 object, constructed as follows:

std::mt19937 generator(m);
DatasetFactory dsf(generator);

According to https://stackoverflow.com/a/15509942/1849221 and https://stackoverflow.com/a/14924350/1849221, the Mersenne Twister PRNG benefits from a warm up phase, which is currently absent in my code. I report for convenience the proposed snippet of code:

#include <random>

std::mt19937 get_prng() {
    std::uint_least32_t seed_data[std::mt19937::state_size];
    std::random_device r;
    std::generate_n(seed_data, std::mt19937::state_size, std::ref(r));
    std::seed_seq q(std::begin(seed_data), std::end(seed_data));
    return std::mt19937{q};
}

The problem in my case is that I need reproducibility of results, i.e., among different executions, for each simulation, the data set has to be the same. That's the reason why in my current solution I use the current simulation to seed the Mersenne Twister PRNG. It seems to me that the usage of std::random_device prevents data from being the same (AFAIK, this is the exact purpose of std::random_device).

EDIT: by different executions I mean re-launching the executable.

How can I introduce the afore-mentioned warm up phase in my code without affecting reproducibility? Thanks.

Possible solution #1

Here's a tentative implementation based on the second proposal by @SteveJessop

#include <random>

std::mt19937 get_generator(unsigned int seed) {
        std::minstd_rand0 lc_generator(seed);
        std::uint_least32_t seed_data[std::mt19937::state_size];

        std::generate_n(seed_data, std::mt19937::state_size, std::ref(lc_generator));
        std::seed_seq q(std::begin(seed_data), std::end(seed_data));
        return std::mt19937{q};
    }

Possible solution #2

Here's a tentative implementation based on the joint contribution by @SteveJassop and @AndréNeve. The sha256 function is adapted from https://stackoverflow.com/a/10632725/1849221

#include <openssl/sha.h>
#include <sstream>
#include <iomanip>
#include <random>

 std::string sha256(const std::string str) {
    unsigned char hash[SHA256_DIGEST_LENGTH];
    SHA256_CTX sha256;
    SHA256_Init(&sha256);
    SHA256_Update(&sha256, str.c_str(), str.size());
    SHA256_Final(hash, &sha256);

    std::stringstream ss;
    for(int i = 0; i < SHA256_DIGEST_LENGTH; i++) 
        ss << std::hex << std::setw(2) << std::setfill('0') << (int)hash[i];

    return ss.str();
}

std::mt19937 get_generator(unsigned int seed) {
    std::string seed_str = sha256(std::to_string(seed));
    std::seed_seq q(seed_str.begin(), seed_str.end());
    return std::mt19937{q};
}

Compile with: -I/opt/ssl/include/ -L/opt/ssl/lib/ -lcrypto

Was it helpful?

Solution

Two options:

  1. Follow the proposal you have, but instead of using std::random_device r; to generate your seed sequence for MT, use a different PRNG seeded with m. Choose one that doesn't suffer like MT does from needing a warmup when used with small seed data: I suspect an LCG will probably do. For massive overkill, you could even use a PRNG based on a secure hash. This is a lot like "key stretching" in cryptography, if you've heard of that. You could in fact use a standard key stretching algorithm, but you're using it to generate a long seed sequence rather than large key material.

  2. Continue using m to seed your MT, but discard a large constant amount of data before starting the simulation. That is to say, ignore the advice to use a strong seed and instead run the MT long enough for it to reach a decent internal state. I don't know off-hand how much data you need to discard, but I expect the internet does.

OTHER TIPS

I think that you only need to store the initial seed (in your case the std::uint_least32_t seed_data[std::mt19937::state_size] array) and the number n of warmup steps you made (eg. using discard(n) as mentioned) for each run/simulation you wish to reproduce.

With this information, you can always create a new MT instance, seed it with the previous seed_data and run it for the same n warmup steps. This will generate the same sequence of values onwards since the MT instance will have the same inner state when the warmup ends.

When you mention the std::random_device affecting reproducibility, I believe that in your code it is simply being used to generate the seed data. If you were using it as the source of random numbers itself, then you would not be able to have reproducible results. Since you are using it only to generate the seed there shouldn't be any problem. You just can't generate a new seed every time if you want to reproduce values!

From the definition of std::random_device:

"std::random_device is a uniformly-distributed integer random number generator that produces non-deterministic random numbers."

So if it's not deterministic you cannot reproduce the sequence of values produced by it. That being said, use it simply to generate good random seeds only to store them afterwards for the re-runs.

Hope this helps

EDIT :

After discussing with @SteveJessop, we arrived at the conclusion that a simple hash of the dataset (or part of it) would be sufficient to be used as a decent seed for the purpose you need. This allows for a deterministic way of generating the same seeds every time you run your simulations. As mentioned by @Steve, you will have to guarantee that the size of the hash isn't too small compared with std::mt19937::state_size. If it is too small, then you can concatenate the hashes of m, m+M, m+2M, ... until you have enough data, as he suggested.

I am posting the updated answer here as the idea of using a hash was mine, but I will upvote @SteveJessop's answer because he contributed to it.

A comment on one of the answers you link to indicates:

Coincidentally, the default C++11 seed_seq is the Mersenne Twister warmup sequence (although the existing implementations, libc++'s mt19937 for example, use a simpler warmup when a single-value seed is provided)

So you may be able to use your current fixed seeds with std::seed_seq to do the warm-up for you.

std::mt19937 get_prng(int seed) {
    std::seed_seq q{seed, maybe, some, extra, fixed, values};
    return std::mt19937{q};
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top