Question

Please see the following link, page 22 onwards:

http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

the above link suggests if I have an object containing vectors/arrays like this:

class MyClass{
    public:
    double a[1000];
    double b[1000];
};

and the code below iterates through a vector of MyClass and performs some Math on std::vector b:

std::vector<MyClass> y;
y.populateVector();

for(auto x : y){
    //Iterate though x.b and do some math;
    for(int i=0; i<1000; i++){
        std::cout << x.b[i] << std::endl;
    }
}

when we retrieve each MyClass object all the data from both of the arrays will be loaded in to the cache line. Is this true? I didnt think the data a would be loaded in to the cache line because the address for accessing b would be calculated and loaded.

I am trying to get an idea how much of a MyClass object is loaded in to the cache in comparison to the useful data required for processing?

I can understand if the very first b element shared the same cache line as the very last a elementbut I didnt think the whole object would get loaded in to the L2/L3 cache just to process one part of the object?

Was it helpful?

Solution 2

Depends on the way that the memory is organized on your system. If it just so happens that the backing arrays for a and b are located very closely in memory (since the CPU will usually issue larger reads to fill cache in the hopes you use it) it's possible that they will be loaded. If not I see no reason for reading b would imply anything to do with a aside from attempting to read some pointers from where the class actually resides in memory.

What it does show is that using classes in haphazard manners can and will cause cache misses just because of the way that they reside in memory.

General rule for what gets loaded into the cache is, if the CPU issues a read and misses the cache it will load a cache aligned chunk from main memory (in the examples there 128 bytes).

For your edited example, Yes these are colocated pieces of memory and parts of a MAY be loaded if reads to b are issued just because of their location in memory.

For your example, each MyClass object consists of a contiguous region of 2000 * sizeof(double) bytes (most likely aligned). These objects are packed into a continuous memory region pointed to by the vector. Accessing the b member of each object will incur a cache miss (if it's not cached). The contents of a cache aligned piece of memory will be loaded from the each read that misses the cache. Depending on the memory alignment constraints and cache size it's possible that some of the entries from the a member will be loaded into memory. It's even possible to assume that due to padding and alignment that it will not be the case that any of your MyClass a members will be loaded into the cache (and there's no reason they should be as they were not accessed).

OTHER TIPS

Your declaration:

for(auto x : y) ...

declares x as a value instead of a reference. It's possible that the compiler could optimize away copying each element of y into the local variable x, but I wouldn't count on it.

If you write:

for(auto &x : y) ...

Then the loop will work on references to the objects in y. I'm assuming that's what you meant to do.

In concrete terms, ignoring struct padding: the compiler will convert

double temp = y[i].b[j];

into something equivalent to

double temp = *(
    y.data() + i * sizeof(MyClass) // start of y[i]
    + 1000 * sizeof(double)        // skip over y[i].a
    + j * sizeof(double));         // get to the right place in y[i].b

and it will load the cache-line-sized block containing that address into a cache line.

Then, as you iterate over more elements of y[i].b, many of them will already be in the cache.

Since the arrays contain 1000 elements each, they are much bigger than cache lines on a typical CPU. 1000 doubles take up 8000 bytes, whereas cache lines on the Sandy Bridge architecture (for example) are 64 bytes. Iterating over the arrays will saturate the cache effectively. You might waste part of a cache line on the first and final elements of x.a, but the effect should be small. As the size of your arrays increases, the significance of these wasted loads approaches 0.

The Playstation article talks about objects that are comparable in size to a cache line. Those optimizations won't matter as much for big objects like yours.

In the link you are referring to, the two arrays a and b are 4x4 matrices, which means 16 elements each. As this is about a video game, they are likely floating-point. 16 floats take 64 bytes. A CPU cache line is 128 bytes. So there is a significant chance that a large part of a is in the same cache line than b[0]. Statistically, 50% of a will be in the same cache line than b[0]. Reading b[0] will then load that part of a in the cache. If you manage to align the class/struct on 128 bytes, you will be even guaranteed that a and b fit entirely in the same cache line.

Now in your example you do not use 16 floats but 1000 doubles. That is 8000 bytes, much larger than a typical cache line. A few last elements of a may be in the same cache line than b[0] but the effect will be small.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top