Question

I'm using perf stat for some purposes and to better understand the working of the tool , I wrote a program that copies a file's contents into another . I ran the program on a 750MB file and the stats are below

   31691336329 L1-dcache-loads                                             
      44227451 L1-dcache-load-misses       
   15596746809 L1-dcache-stores                                            
      20575093 L1-dcache-store-misses                                      
      26542169 cache-references                                            
      13410669 cache-misses                 
   36859313200 cycles                            
   75952288765 instructions                      
      26542163 cache-references

what is the units of each number . what I mean is . Is it bits/bytes/ or something else . Thanks in advance.

Was it helpful?

Solution

The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.

31 691 336 329 L1-dcache-loads                                             
    44 227 451 L1-dcache-load-misses       
15 596 746 809 L1-dcache-stores                                            
    20 575 093 L1-dcache-store-misses                                      


    26 542 169 cache-references                                            
    13 410 669 cache-misses                 

Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program

36 859 313 200 cycles                            

This is total count of instructions, executed from your program:

75 952 288 765 instructions                      

(I will use G suffix as abbreviation for billion)

From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.

In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.

Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html

perf stat -d  md5sum *

    578.920753 task-clock                #    0.995 CPUs utilized
           211 context-switches          #    0.000 M/sec
             4 CPU-migrations            #    0.000 M/sec
           212 page-faults               #    0.000 M/sec
 1,744,441,333 cycles                    #    3.013 GHz                     [20.22%]
 1,064,408,505 stalled-cycles-frontend   #   61.02% frontend cycles idle    [30.68%]
   104,014,063 stalled-cycles-backend    #    5.96% backend  cycles idle    [41.00%]
 2,401,954,846 instructions              #    1.38  insns per cycle
                                         #    0.44  stalled cycles per insn [51.18%]
    14,519,547 branches                  #   25.080 M/sec                   [61.21%]
       109,768 branch-misses             #    0.76% of all branches         [61.48%]
   266,601,318 L1-dcache-loads           #  460.514 M/sec                   [50.90%]
    13,539,746 L1-dcache-load-misses     #    5.08% of all L1-dcache hits   [50.21%]
             0 LLC-loads                 #    0.000 M/sec                   [39.19%]
(wrongevent?)0 LLC-load-misses           #    0.00% of all LL-cache hits    [ 9.63%]

   0.581869522 seconds time elapsed

UPDATE Apr 18, 2014

please explain why cache-references are not correlating with L1-dcache numbers

Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.

44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top