Question

I have core i7 system having L1 cache size 32KB, L2 cache size 256KB, shared L3 cache size 8MB( shared among 4 cores). I have written a program where I execute part A,B,C in a sequential manner. (A) create a big int array with size of 4 times of L2 cache and accessing every 16th element of big array (cache line size is 64B, 16*4 B=64B) to make sure all my data is loaded into L2. Noting down the access time of every element of big array (B) then use clflush to manually evict data from multiple location of my data array like clflush(&bigarray[0]) ...clflush (&bigarray[1024]). (C) again accessing every 16th element of big array including those lines which were evicted manually in (B).

I put rdtsc() before and after the line I am accessing my big array to find the access time. I got to know use of clflush in i3/i7 machine from this link clflush() in i3 or i7 processors

asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx");

I am getting higher access time after eviction for a single line as expected, TO MY SURPRISE, not getting rise in accessing time for multiple locations.

Let me explain it in other way:

Scenario1 : Accessing single array element before and after clflush

  • step1- access (a[x]) and find access time
  • step2- use clflush(&a[x]) to evict it from cache and find access time

Scenario2 : Accessing multiple array elements before and after clflush

  • step1- access each element a[i] of array
  • step 2-

    for all i { 
        clflush( &a[i]) } to evict from cache and find access time of all elements at i.
    

I am not getting the higher access time for accessing the array elements after clflush although I was getting the expected result as in Scenario-1.

What is the reason ? How to over come and get to know the correct access time after eviction. Heard about hardware and software prefetching, are they influencing my result? How to overcome there influence and get to know correct result?

Was it helpful?

Solution

Try rerunning after you disable HW prefetchers through bios (or any other means). You describe a very steady stream which would immediately get recognized by a HW stream prefetcher, and fetched well in advance of your loads (making the access time exactly the same as a regular L2 lookup)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top