Question

I want to learn how people do cache optimization and I was suggested cachegrind by a friend as a useful tool towards this goal.

Valgrind being a CPU simulator, assumes a 2-level cache, as mentioned here, when using cachegrind

Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.

The next paragraph continues as

However, some modern machines have three or four levels of cache. For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and last-level caches. The reason for this choice is that the last-level cache has the most influence on runtime, as it masks accesses to main memory.

However when I tried running the valgrind on my simple matrix-matrix multiplication code, I got the following output.

==6556== Cachegrind, a cache and branch-prediction profiler
==6556== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==6556== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==6556== Command: ./a.out
==6556== 
--6556-- warning: L3 cache detected but ignored
==6556== 
==6556== I   refs:      50,986,869
==6556== I1  misses:         1,146
==6556== L2i misses:         1,137
==6556== I1  miss rate:       0.00%
==6556== L2i miss rate:       0.00%
==6556== 
==6556== D   refs:      20,232,408  (18,893,241 rd   + 1,339,167 wr)
==6556== D1  misses:       150,194  (   144,869 rd   +     5,325 wr)
==6556== L2d misses:        10,451  (     5,506 rd   +     4,945 wr)
==6556== D1  miss rate:        0.7% (       0.7%     +       0.3%  )
==6556== L2d miss rate:        0.0% (       0.0%     +       0.3%  )
==6556== 
==6556== L2 refs:          151,340  (   146,015 rd   +     5,325 wr)
==6556== L2 misses:         11,588  (     6,643 rd   +     4,945 wr)
==6556== L2 miss rate:         0.0% (       0.0%     +       0.3%  )

According to the documentation, the L1 and the L3 caches should have been used but the output says that L3 cache is being ignored. Why is that?

Also does cachegrind preassume what the L1 and last-level cache sizes are, or does it use the L1 and last-level cache sizes of the CPU it is currently running on?

Was it helpful?

Solution

You're running on an intel CPU that cachegrind appears to not have full support for. They inspect the cpuid flags and determine support based on a huge set of case statements for different processors.

This is from a unofficial copy of the code, but is illustrative - https://github.com/koriakin/valgrind/blob/master/cachegrind/cg-x86-amd64.c:

/* Intel method is truly wretched.  We have to do an insane indexing into an
 * array of pre-defined configurations for various parts of the memory
 * hierarchy.
 * According to Intel Processor Identification, App Note 485.
 */
static
Int Intel_cache_info(Int level, cache_t* I1c, cache_t* D1c, cache_t* L2c)
{
...
      case 0x22: case 0x23: case 0x25: case 0x29:
      case 0x46: case 0x47: case 0x4a: case 0x4b: case 0x4c: case 0x4d:
      case 0xe2: case 0xe3: case 0xe4: case 0xea: case 0xeb: case 0xec:
          VG_(dmsg)("warning: L3 cache detected but ignored\n");
          break;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top