Measuring parallel computation time for interdependent threads

https://stackoverflow.com/questions/10242728

02-06-2021
|

Question

I have a question concerning runtime measurements in parallel programs (I used C++ but I think the question is more general).

Some short explanations: 3 threads are running parallel (pthread), solving the same problem in different ways. Each thread may pass information to the other thread (e.g. partial solutions obtained by the one thread but not by the other, yet) for speeding up the other threads, depending on his own status / available information in his own calculation. The whole process stops as soon as the first thread is ready. Now I would like to have a unique time measurement for evaluating the runtime from start until the problem is solved. ( In the end, I want to determine if using synergy effects through a parallel calculation is faster then calculation on a single thread).

In my eyes, the problem is, that (because of the operating system pausing / unpausing the single threads), the point when information is passed in the process is not deterministic in each process' state. That means, a certain information is acquired after xxx units of cpu time on thread 1, but it can not be controlled, whether thread 2 receives this information after yyy or zzz units of cpu time spent in its calculations. Assumed that this information would have finished thread 2's calculation in any case, the runtime of thread 2 was either yyy or zzz, depending on the operating system's action.

What can I do for obtaining a deterministic behaviour for runtime comparisons? Can I order the operation system to run each thread "undisturbed" (on a multicore machine)? Is there something I can do on implementation (c++) - basis?

Or are there other concepts for evaluating runtime (time gain) of such implementations?

Best regards Martin

Solution

Any time someone uses the terms 'deterministic' and 'multicore' in the same sentence, it sets alarm bells ringing :-)

There are two big sources of non-determinism in your program: 1) the operating system, which adds noise to thread timings through OS jitter and scheduling decisions; and 2) the algorithm, because the program follows a different path depending on the order in which communication (of the partial solutions) occurs.

As a programmer, there's not much you can do about OS noise. A standard OS adds a lot of noise even for a program running on a dedicated (quiescent) node. Special purpose operating systems for compute nodes go some way to reducing this noise, for example Blue Gene systems exhibit significantly less OS noise and therefore less variation in timings.

Regarding the algorithm, you can introduce determinism to your program by adding synchronisation. If two threads synchronise, for example to exchange partial solutions, then the ordering of the computation before and after the synchronisation is deterministic. Your current code is asynchronous, as one thread 'sends' a partial solution but does not wait for it to be 'received'. You could convert this to a deterministic code by dividing the computation into steps and synchronising between threads after each step. For example, for each thread:

Compute one step
Record partial solution (if any)
Barrier - wait for all other threads
Read partial solutions from other threads
Repeat 1-4

Of course, we would not expect this code to perform as well, because now each thread has to wait for all the other threads to complete their computation before proceeding to the next step.

The best approach is probably to just accept the non-determinism, and use statistical methods to compare your timings. Run the program many times for a given number of threads and record the range, mean and standard deviation of the timings. It may be enough for you to know e.g. the maximum computation time across all runs for a given number of threads, or you may need a statistical test such as Student's t-test to answer more complicated questions like 'how certain is it that increasing from 4 to 8 threads reduces the runtime?'. As DanielKO says, the fluctuations in timings are what will actually be experienced by a user, so it makes sense to measure these and quantify them statistically, rather than aiming to eliminate them altogether.

OTHER TIPS

What's the use of such a measurement?

Suppose you can, by some contrived method, set up the OS scheduler in a way that the threads run undisturbed (even by indirect events such as other processes using caches, MMU, etc), will that be realistic for the actual usage of the parallel program?

It's pretty rare for a modern OS to let an application take control over general interrupts handling, memory management, thread scheduling, etc. Unless you are talking directly to the metal, your deterministic measurements will not only be impractical, but the users of your program will never experience them (unless they are equally close to the metal as when you did the measurements.)

So my question is, why do you need such strict conditions for measuring your program? In the general case, just accept the fluctuations, as that is what the users will most likely see. If the speed up of a certain algorithm/implementation is so insignificant as to be indistinguishable from the background noise, that's more useful information to me than knowing the actual speedup fraction.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow