Question

My machine is Intel IvyBride architecture. My L3 cache is 12MB, 16way-associative, cache line size 64B.

I have a very large array long array[12MB/sizeof(long)] in my program. I want to preload the large array before the program execute to speed up the initialization process.

One way I can think out to achieve this is accessing the whole array from index 0 to the end of the array "in sequence". However, the time to accessing the whole array is too large. This approach uses one core.

Another way I can do is using several threads to parallelly accessing the whole array. Each thread accesses only a part of the array. Since these threads can run on several cores, this could speed up preloading the array to the shared cache. However, this approach needs multi-cores to run these threads.

My question is: Is there any hardware (like DMA) I can use to issue a command and make the hardware to preload a bunch of data to the shared cache?

Was it helpful?

Solution

It could be possible in some conditions - check if your CPU supports "DCA" (Direct Cache Access), and if you can activate this feature. This might be useful: https://www.myricom.com/software/myri10ge/790-how-do-i-enable-intel-direct-cache-access-dca-with-the-linux-myri10ge-driver.html

I don't think you really need this though, going over the entire array sequentially should be very efficient as it would be easily recognized by the CPU as a sequential stream and trigger the HW prefetcher. Since it's IvyBridge, even linear page crossing should be fast since it can prefetch across to the next physical page. There may be a little optimization in accessing several pages in parallel (also in terms of TLB miss latencies), but eventually it all boils down to the question - can you saturate your memory Bandwidth. A single core would probably run into a bottleneck in the core/L3 boundary, so the optimal way would be distributing the work by running a HW thread on each core, each to a different segment (the size could be one 4k page per iteration, but larger chunks would also enjoy the benefit of pagemap locality in each core)

However, you may have a bigger problem than accessing the data, and that's to convince the L3 to keep it there. IvyBridge is said to use a dynamic replacement policy in the L3, meaning that it's going to ask itself - who's using all this data, and since you're just preloading it once, the answer would probably be "no one". At that point, the L3 may decide to avoid caching that array altogether, or write newer blocks over older ones.

The exact behavior depends on the actual implementation that wasn't published, but to "trick" it, I believe you'd have to access each data line more than once before it gets' thrown away. Note that just accessing it twice in a row won't help since it's already in the upper caches, you'll have to access it at some distance - not too little as to access the L3 again, but not too large to avoid it getting thrown away. Some experimentation would be required of course to fine tune this.

EDIT:

Here's a blog post covering IvyBridges' L3 replacement policy that you should worry about -
http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/

The actual process of course should behave nicely since it should get caught as utilizing the L3 caching benefits, it's just the preload phase that might give you trouble. If the processing is relatively long then the initial cold misses may not be worth the effort of preloading - beware of premature optimization.

OTHER TIPS

DMA can only preload your array into main memory from e.g. disk, it does not work with caches. Besides the time to load 12MB from RAM into cache is insignificant compared to what it costs to load it from disk to RAM.

To achieve the latter you can use mmap/MAP_POPULATE. This will leave the mechanism by which your data is prefetched into RAM up to the kernel implementation, but will in general be faster than doing it manually. The kernel will most likely use DMA or a similar mechanism for this.

Loading stuff into caches is a far bigger problem, especially that you can't control how caches are evicted. The closest you can get is the prefetch instruction (gcc __builtin_prefetch(const void *addr, ...)), but this won't even guarantee a prefetch, plus you have to call it on every cache line which will probably take more time to do than a cache miss.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top