Question

I'm using org.apache.commons.math3.distribution.NormalDistribution in a large distributed Scala & Akka application. During debugging I found sample() was occasionally returning NaN, which propagated silently and caused threads to hang in org.apache.commons.math3.ode.nonstiff.DormandPrince853Integrator

The NaN can be reproduced simply with parallel colelctions (doesn't happen in sequential code):

val normal = new NormalDistribution(0,0.1)
(1 to 1000000000).par.foreach{i =>
    val r = normal.sample
    if(r.isNaN()) throw new Exception("r = "+r)
}

Obviously moving the val normal inside the foreach solves the issue in this case.

I've looked at the docs but can't see anything warning me of such issues. Have I failed to grasp a more fundamental concept about thread safety? Needless to say I'm now checking for NaN.

Was it helpful?

Solution

By digging through sources you can find that this constructor uses Well19937c random generator, which doesn't look thread-safe by itself at the first glance.

You can make it thread safe, by explicitly setting the number generator to SynchronizedRandomGenerator which wraps any other random number generator (like Well19937c or Mersenne Twister). Note that by synchronizing access to random number generator with SynchronizedRandomGenerator you'll lose all potential performance benefits and the 'parallel' version will be probably slower than a sequential one because of the synchronization. On the other hand, re-initializing the random distribution on every iteration in parallel will probably re-seed the PRNG multiple times with similar values based on current time, so your results will be skewed.

A very general rule of the thumb (and if I'm wrong here, please correct me) is that 99% of the time, unless explicitly stated otherwise, you should probably stick to sequential execution when doing anything that relies on random number generation, as usually PRNGs will store state that might get corrupted when calling them from multiple threads. And unless you're doing expensive computations afterwards, the synchronization (in case of thread-safe stateful PRNGs) will be a bottleneck.

OTHER TIPS

A middle ground would be to create normal as a thread local, perhap's by using Twitter's Local implementation.

This would help if the normal.sample method was particularly expensive. You can also be certain that no two parallel operations will be running on the same thread at the same time :)

It appens probably because you're using a non-threadsafe object in multithreaded environment (you are calling method sample twice or more concurrently). You have to use another thread-safe generator or an instance of NormalDistribution for each thread or synchronize access to single instance (probably losing any benefit of par execution).

Try using another thread-safe generator or an instance of NormalDistribution for each thread or synchronize access to single instance. Because I think that you are using a non-threadsafe object in multi threaded environment

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top