how to preload a large array to cache in parallel?

Question 1

It could be possible in some conditions - check if your CPU supports "DCA" (Direct Cache Access), and if you can activate this feature. This might be useful: https://www.myricom.com/software/myri10ge/790-how-do-i-enable-intel-direct-cache-access-dca-with-the-linux-myri10ge-driver.html

I don't think you really need this though, going over the entire array sequentially should be very efficient as it would be easily recognized by the CPU as a sequential stream and trigger the HW prefetcher. Since it's IvyBridge, even linear page crossing should be fast since it can prefetch across to the next physical page. There may be a little optimization in accessing several pages in parallel (also in terms of TLB miss latencies), but eventually it all boils down to the question - can you saturate your memory Bandwidth. A single core would probably run into a bottleneck in the core/L3 boundary, so the optimal way would be distributing the work by running a HW thread on each core, each to a different segment (the size could be one 4k page per iteration, but larger chunks would also enjoy the benefit of pagemap locality in each core)

However, you may have a bigger problem than accessing the data, and that's to convince the L3 to keep it there. IvyBridge is said to use a dynamic replacement policy in the L3, meaning that it's going to ask itself - who's using all this data, and since you're just preloading it once, the answer would probably be "no one". At that point, the L3 may decide to avoid caching that array altogether, or write newer blocks over older ones.

The exact behavior depends on the actual implementation that wasn't published, but to "trick" it, I believe you'd have to access each data line more than once before it gets' thrown away. Note that just accessing it twice in a row won't help since it's already in the upper caches, you'll have to access it at some distance - not too little as to access the L3 again, but not too large to avoid it getting thrown away. Some experimentation would be required of course to fine tune this.

EDIT:

Here's a blog post covering IvyBridges' L3 replacement policy that you should worry about -
http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/

The actual process of course should behave nicely since it should get caught as utilizing the L3 caching benefits, it's just the preload phase that might give you trouble. If the processing is relatively long then the initial cold misses may not be worth the effort of preloading - beware of premature optimization.

Question 2

DMA can only preload your array into main memory from e.g. disk, it does not work with caches. Besides the time to load 12MB from RAM into cache is insignificant compared to what it costs to load it from disk to RAM.

To achieve the latter you can use mmap/MAP_POPULATE. This will leave the mechanism by which your data is prefetched into RAM up to the kernel implementation, but will in general be faster than doing it manually. The kernel will most likely use DMA or a similar mechanism for this.

Loading stuff into caches is a far bigger problem, especially that you can't control how caches are evicted. The closest you can get is the prefetch instruction (gcc __builtin_prefetch(const void *addr, ...)), but this won't even guarantee a prefetch, plus you have to call it on every cache line which will probably take more time to do than a cache miss.