I want to learn how people do cache optimization and I was suggested cachegrind by a friend
as a useful tool towards this goal.
Valgrind being a CPU simulator, assumes a 2-level cache, as mentioned here, when using cachegrind
Cachegrind simulates how your program interacts with a machine's cache
hierarchy and (optionally) branch predictor. It simulates a machine
with independent first-level instruction and data caches (I1 and D1),
backed by a unified second-level cache (L2). This exactly matches the
configuration of many modern machines.
The next paragraph continues as
However, some modern machines have three or four levels of cache. For
these machines (in the cases where Cachegrind can auto-detect the
cache configuration) Cachegrind simulates the first-level and
last-level caches. The reason for this choice is that the last-level
cache has the most influence on runtime, as it masks accesses to main
memory.
However when I tried running the valgrind on my simple matrix-matrix multiplication code,
I got the following output.
==6556== Cachegrind, a cache and branch-prediction profiler
==6556== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==6556== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==6556== Command: ./a.out
==6556==
--6556-- warning: L3 cache detected but ignored
==6556==
==6556== I refs: 50,986,869
==6556== I1 misses: 1,146
==6556== L2i misses: 1,137
==6556== I1 miss rate: 0.00%
==6556== L2i miss rate: 0.00%
==6556==
==6556== D refs: 20,232,408 (18,893,241 rd + 1,339,167 wr)
==6556== D1 misses: 150,194 ( 144,869 rd + 5,325 wr)
==6556== L2d misses: 10,451 ( 5,506 rd + 4,945 wr)
==6556== D1 miss rate: 0.7% ( 0.7% + 0.3% )
==6556== L2d miss rate: 0.0% ( 0.0% + 0.3% )
==6556==
==6556== L2 refs: 151,340 ( 146,015 rd + 5,325 wr)
==6556== L2 misses: 11,588 ( 6,643 rd + 4,945 wr)
==6556== L2 miss rate: 0.0% ( 0.0% + 0.3% )
According to the documentation, the L1 and the L3 caches should have been used but the output says that L3 cache is being ignored. Why is that?
Also does cachegrind preassume what the L1 and last-level cache sizes are, or does it use the L1 and last-level cache sizes of the CPU it is currently running on?