cpu cacheline and prefetch policy

Question

Short Answer (TL;DR): you're accessing uninitialized data, your first loop has to allocate new physical pages for the entire array within the timed loop.

When I run your code and comment each of the sections in turn, I get almost the same timing for the two loops. However, I do get the same results you report when I uncomment both sections and run them one after the other. This makes me suspect you also did that, and suffered from cold start effect when comparing the first loop with the second. It's easy to check - just replace the order of the loops and see if the first is still slower.

To avoid, either pick a large enough LENGTH (depending on your system) so that you dont get any cache benefits from the first loop helping the second, or just add a single traversal of the entire array that's not timed.

Note that the second option wouldn't exactly prove what the blog wanted to say - that memory latency masks the execution latency, so it doesn't matter how many elements of a cache line you use, you're still bottlenecked by the memory access time (or more accurately - the bandwidth)

Also - benchmarking code with -O0 is a really bad practice

Edit:

Here's what i'm getting (removed the scheduling as it's not related).
This code:

for (i = 0; i < LENGTH; i++) arr[i] = 1;   // warmup!

clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i++) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 1, jobTime(start, stop));

clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i+=16) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);

Gives :

---------sieofint 4
step 1 : time 58862552
step 16 : time 50215446

While commenting the warmup line gives the same advantage as you reported on the second loop:

---------sieofint 4
step 1 : time 279772411
step 16 : time 50615420

Replacing the order of the loops (warmup is still commented) shows it's indeed not related to the step size but to the ordering:

---------sieofint 4
step 16 : time 250033980
step 1 : time 59168310

(gcc version 4.6.3, on Opteron 6272)

Now a note about what's going on here - in theory, you'd expect warmup to be meaningful only when the array is small enough to sit in some cache - in this case the LENGTH you used is too big even for the L3 on most machines. However, you're forgetting the pagemap - you didn't just skip warming the data itself - you avoided initializing it in the first place. This can never give you meaningful results in real life, but since this a benchmark you didn't notice that, you're just multiplying junk data for the latency of it.

This means that each new page you access on the first loop doesn't only go to memory, it would probably get a page fault and have to call the OS to map a new physical page for it. This is a lengthy process, multiplies by the number of 4K pages you use - accumulating to a very long time. At this array size you can't even benefit from TLBs (you have 16k different physical 4k pages, way more than most TLBs can support even with 2 levels), so it's just the question of the fault flows. This can probably be measures by any profiling tool.

The second iteration on the same array won't have this effect and would be much faster - even though is still has to do a full pagewalk on each new page (that's done purely in HW), and then fetch the data from memory.

By the way, this is also the reason when you benchmark some behavior, you repeat the same thing multiple times (in this case it would have solved your problem if you had run over the array several time with the same stride, and ignored the first few rounds).