I finally got a chance to benchmark the code with a completely unloaded system:
For the dynamic schedule I used schedule(dynamic,1000000)
. For the static schedule I used the default (evenly between the cores). For thread binding I used export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47"
.
The main reason for the highly nonlinear scaling for this code is because what AMD calls "cores" aren't actually independent cores. This was part (1) of redrum's answer. This is clearly visible in the plot above from the sudden plateau of speedup at 24 threads; it's really obvious with the dynamic scheduling. It's also obvious from the thread binding that I chose: it turns out what I wrote above would be a terrible choice for binding, because you end up with two threads in each "module".
The second biggest slowdown comes from static scheduling with a large number number of threads. Inevitably there is an unbalance between the slowest and fastest threads, introducing large fluctuations in the run time when the iterations are divided in large chunks with the default static scheduling. This part of the answer came both from Hristo's comments and Salt's answer.
I don't know why the effects of "Turbo Boost" aren't more pronounced (part 2 of Redrum's answer). Also, I'm not 100% certain where (presumably in overhead) the last bit of the scaling comes is lost (we get 22x performance instead of expected 24x from linear scaling in number of modules). But otherwise the question is pretty well answered.