What limits scaling in this simple OpenMP program?

Question 1

I finally got a chance to benchmark the code with a completely unloaded system: enter image description here

For the dynamic schedule I used schedule(dynamic,1000000). For the static schedule I used the default (evenly between the cores). For thread binding I used export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47".

The main reason for the highly nonlinear scaling for this code is because what AMD calls "cores" aren't actually independent cores. This was part (1) of redrum's answer. This is clearly visible in the plot above from the sudden plateau of speedup at 24 threads; it's really obvious with the dynamic scheduling. It's also obvious from the thread binding that I chose: it turns out what I wrote above would be a terrible choice for binding, because you end up with two threads in each "module".

The second biggest slowdown comes from static scheduling with a large number number of threads. Inevitably there is an unbalance between the slowest and fastest threads, introducing large fluctuations in the run time when the iterations are divided in large chunks with the default static scheduling. This part of the answer came both from Hristo's comments and Salt's answer.

I don't know why the effects of "Turbo Boost" aren't more pronounced (part 2 of Redrum's answer). Also, I'm not 100% certain where (presumably in overhead) the last bit of the scaling comes is lost (we get 22x performance instead of expected 24x from linear scaling in number of modules). But otherwise the question is pretty well answered.

Question 2

I'm not sure this qualifies as an answer but it feels like more than a comment, so here we go.

I've never noticed particularly linear performance against the number of threads in any of my projects. For one thing, there's the scheduler, which is anything but rigorously fair, seems to me. OpenMP probably divides the task evenly among its team of threads at the outset, then joins each. On every Linux box I've had the pleasure of, I would expect a few threads to finish early, and a few threads to lag. Other platforms will vary. However that works out, of course you're waiting for the slowest to catch up. So stochastically speaking, there's a pulse of threading processing going by in something of a bell curve, the more threads the wider I should think, and you're never done until the trailing edge crosses the finish line.

What does top say? Does it tell you your process gets 2000% CPU at 20 threads, 4000% at 40? I bet it tapers off. htop by the way typically shows a process total, and separate lines for each thread. That might be interesting to watch.

With a tiny loop like that, you're probably not running into cache thrash or any such annoyance. But another issue that's bound to shave some performance off the top: like any modern multi-core CPU the Opteron runs at a higher clock rate when it's cool. The more cores you heat up, the less turbo mode you'll see.

Question 3

I have two important points as two why your results are not linear. The first one is about Intel hyper-threading and AMD modules. The next one is about turbo frequency modes with Intel and AMD

1.) Hyper-threading and AMD modules/cores

Too many people confuse Intel Hyper threading and AMD cores in modules as real cores and expect a linear speed up. An Intel processor with hyper-threading can run twice as many hyper-threads/hardware threads as cores. AMD also has it's own technology where the fundamental unit is called a module and each module has what AMD disingenuously calls a core What's a module, what's a core. One reason this is easily confused is that for example with Task Mangager in windows with hyper-treading it shows the number of hardware threads but it says CPUs. This is misleading as it's not the number of CPU cores.

I don't have enough knowledge of AMD to go into details but as far as I understand each module has one floating point unit (but two integer units). Therefore, you can't really expect a linear speed up beyond the number of Intel cores or AMD modules for floating point operations.

In your case the Opteron 6348 has 2 dies per processor each with 3 modules which each as 2 "cores". Though this gives 12 cores there are really only 6 floating point units.

I ran your code on my single socket Intel Xeon E5-1620 @ 3.6 GHz. This has 4 cores and hyper-threading (so eight hardware threads). I get:

1 threads: 156s 
4 threads: 37s  (156/4 = 39s)
8 threads: 30s  (156/8 = 19.5s)

Notice that for 4 threads the scaling is almost linear but for 8 threads the hyper-threading only helps a little (at least it helps). Another strange observation is that my single threaded results are much lower than yours (MSVC2013 64bit release mode). I would expect a faster single threaded ivy bridge core to easily trump a slower AMD pile driver core. This does not make sense to me.

2.) Intel Turbo Boost and AMD Turbo Core.

Intel has a technology called Turbo Boost which changes the clock frequency based on the number of threads that are running. When all threads are being run the turbo boost is at it's lowest value. On Linux the only application I know that can measure this when an operation is running is powertop. Getting the real operating frequency is not something so easy to measure (for one it needs root access). On Windows you can use CPUz. In any case the result is that you can't expect linear scaling when only running one thread compared to running the maximum number of real cores.

Once again, I have little experience with AMD processors but as far as I can tell their technology is called Turbo Core and I expect the effect to be similar. This is the reason that a good benchmark disables turbo frequency modes (in the BIOS if you can) when comparing threaded code.