Question

I am trying to understand the performance implications of using CLFLUSH. For this, I wrote a small pointer chasing benchmark. I take an std::vector<std::pair<size_t, size_t>> where the first element is the offset of the next entry and the second element is a payload. I go from entry 0 to the next entry and so on until I reach the beginning. On my way, I calculate the sum of all payloads.

Also, I have two parameters: If write==1, I modify the payload after reading it (thus invalidating the cache line). If clflush==1, I perform a CLFLUSH before going to the next element.

The size of the vector is equal to the size of the L1 cache (32 KiB).

Here are my results:

write   clflush runtime
0       0       5324060
0       1       298751237
1       0       4366570
1       1       180303091

I do understand why the runs with clflush are slower than without. But why are reads+writes faster than writes and why does it appear to be faster to CLFLUSH a dirty cache line than a clean one?

For reference, you can find my benchmark here, I compiled it using g++-4.8 -std=c++11 -lrt -O3.

Was it helpful?

Solution

This may not feel like an answer, but I don't think the effects you are seeing are real. Here's what I see when I run your test program on an Haswell i7-4770 with some different compilers:

nate@haswell:~/stack$ chase-g481-orig
write   clflush runtime
0   0   3238411
0   1   55916728
1   0   3220700
1   1   88299263
nate@haswell:~/stack$ chase-icpc-orig
write   clflush runtime
0   0   3226673
0   1   53840185
1   0   4858013
1   1   88143220
nate@haswell:~/stack$ chase-clang-orig
write   clflush runtime
0   0   13521595
0   1   54542441
1   0   3394006
1   1   88344640

Lot of differences between them, but nothing matching what you saw. I also ran on a Sandy Bridge E5-1620 and found similar results to these (not matching yours), although the older version of clang++ on that machine didn't blow up on the no-write no-flush case.

First, it's a little awkward that your program attempts to use the entire L1 cache. If you had complete control of the system (reserved CPU at boot) this might be reasonable, but it seems likely to introduce confounding effects. If your goal is understanding this effect instead of seeing how the cache behaves at full capacity, I'd suggest changing your total size to 1/2 the cache size of less.

I think the most likely explanation is that the different compilers are hoisting the clflush to different places in the function, and some of them aren't doing what you intend it to do. It can be very difficult to actually convince a compiler to do what you want when you are working at this level. Since the clflush intrinsic doesn't actually alter the result, the optimizers rules often destroy your intent.

I tried looking that the generated assembly (objdump -d -C chase), and had trouble getting my bearings. Everything is inlined directly into main, so it wasn't as simple as just looking at the chase() function to see what was happening. Compiling with -g (for debugging) and adding -S (for source) to the objdump command helped, but still complex. My attempts to stop the compilers from inlining failed.

If it were me, I'd switch to C and compile with -fno-inline-functions and check to see if you still get the same effect. Then dissect the chase() function until you understand what's happening. Then use gcc -S to output the assembly, modify it until it's in the correct order, and see if the effect is still there.

It's also worth noting that according to the Intel Architecture Reference Manual, clflush is not a serializing instruction. Even if the assembly is in the order you think it should be, it's fair for the processor to execute instructions that come after before, and before after. Given the way you are chasing, I don't think the window is wide enough for this to be a factor, but who knows. You can enforce serialization by adding an mfence.

Another possibility is that clflush behaves oddly on your particular processor. You could switch the nuclear option of using 'wbinvd' to invalidate all the caches. It's a difficult to execute instruction, as it is 'priviliged' and needs to be executed by the kernel. You'd have to write a ioctl to do it.

Good luck!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top