The main problem, as others have pointed out, is that the 128-byte data you are checking is missing the data cache and/or the TLB and going to DRAM, which is slow. VTune is telling you this
cmp qword ptr [rax], 0x0 0.171s
jnz 0x14000222 42.426s
You have another, smaller, hotspot half-way down
cmp qword ptr [rax+0x40], 0x0 0.156s
jnz 0x14000222 2.550s
Those 42.4 + 2.5 seconds accounted to the JNZ instructions are really a stall caused by the previous load from memory... the processor is sitting around doing nothing for 45 seconds total over the time you profiled the program...waiting on DRAM.
You might ask what the 2nd hotspot half-way down is all about. Well, you are accessing 128-bytes and cache lines are 64-bytes, the CPU started prefetching for you as soon as it read the first 64-bytes... but you didn't do enough work with the first 64-bytes to totally overlap the latency of going to memory.
The memory bandwidth of Ivy Bridge is very high (it depends on your system, but I'm guessing over 10 GB/sec). Your block of memory is 4GB, you should be able to zip thru it in less than 1 second if you access it sequentially and let the CPU prefetch data ahead for you.
My guess is you are thwarting the CPU data prefetcher by accessing the 128-byte blocks in a non-contiguous fashion.
Change your access pattern to be sequential and you'll be surprised how much faster it runs. You can then worry about the next level of optimization, which will be making sure the branch prediction works well.
Another thing to consider is TLB misses
. Those are costly, especially in a 64-bit system. Rather than using 4KB pages consider using 2MB huge pages
. See this link for Windows support for these: Large-Page Support (Windows)
If you must access the 4GB data in a somewhat random fashion, but you know ahead of time the sequence of m7
values (your index) then you can pipeline
the memory fetching explicitly ahead of your use (it needs to be several 100 CPU cycles ahead of when you will be using it to be effective). See
Here are some links that might be helpful in general on the subject of memory optimizations
What Every Programmer Should Know About Memory by Ulrich Drepper
http://www.akkadia.org/drepper/cpumemory.pdf
Machine Architecture: Things Your Programming Language Never Told You, by Herb Sutter
http://www.gotw.ca/publications/concurrency-ddj.htm
http://nwcpp.org/static/talks/2007/Machine_Architecture_-_NWCPP.pdf
http://video.google.com/videoplay?docid=-4714369049736584770#