Why is accessing elements of a huge dynamically allocated structure a lot slower than a small dynamically allocated array in C++?

https://softwareengineering.stackexchange.com/questions/360417

22-01-2021
|

Question

I am doing C++ programming in Ubuntu and really care about the efficiency of my code. The computer that I work with has 32 GB of RAM and the compilation is done with the C++11 option. I noticed that for a very large dynamically allocated array such as my_array_1 in the following code, accessing elements occur significantly slower than a [relatively] small arrays such as my_array_2. I tested that with structures, but I would suspect this is true for any type of large variables (??). See this code as an example:

#define NT 100000

typedef struct {
  float ind_1[4096];
  float ind_2[4096];
  int n;
} ind_vec; // 32 KB

// .....

ind_vec *my_array_1; // a huge struct
int *my_array_2; // a small vector

my_array_1 = new ind_vec[NT]; // about 3 GB
my_array_2 = new int[NT]; // about 400 KB

for(int i = 0; i<100; i++){ // This loop is slow!
  // I don't involve ind_1 and ind_2 for now
  my_array_1[i].n = 1; 
}

for(int j = 0; j<100; j++){ // This loop is fast!
  my_array_2[j] = 1;
}

delete[] my_array_1;
delete[] my_array_2;

As I indicated in the code, the 1st loop is much slower than the 2nd one (on the order of 1000 times). Exact timing of each loop was done through a simple function using gettimeofday (not shown here)

Q1) I'd assume that these two loops both effectively do the same job (form the computer's perspective) through the same approach. Is the performance difference because my_array_2 is allocated on heap while my_array_1 is perhaps allocated somewhere else (I don't know where)?

Q2) Is there any workaround here?

Solution

You have some very good answers on the topic here

Generally, your struct is probably too big for the CPU cache, so probably parts of it end up in L2 cache or in RAM memory, which is significantly slower than L1 cache, hence performance issues. You might want to try and do some profiling and find out exactly what is going on. If you do, I would very much like to read the results.

If you are striving for performance, ask yourself why do you need a struct with two arrays of that size? Could you achieve similar performance if you just had int pointers, and then allocate the arrays dynamically as needed? Could you simply lose the struct and handle the members independently? I know the last approach is very ugly, but when performance is the ultimate request, some sacrifices in code readability must be made.

OTHER TIPS

As mentioned in the comments, it's impossible to say for sure without seeing what the compiler outputs and seeing how the OS handles memory allocation. But we can take some educated guesses. In modern OSes, when you allocate memory, it may be reserved but not yet wired in. That is, the OS will check to see if there's enough memory left in the virtual address space to accommodate the request, but if there is, it will just reserve the range. It won't actually take any action to make the memory available for use. It waits until you try to touch a byte of the memory before it actually wires that memory into real memory. When you do touch a byte of the memory, the OS will look to see if there's any physical memory available to hold that block of virtual memory. If so, it marks that block as in use and your code will start using that. It takes time to do that check and mark the block for use.

In the case of the smaller allocation, there might only be a few dozen blocks of virtual memory that need to be brought into physical memory for use. But for the larger allocation, as you go through the block by wider strides, those checks and marks are getting done more frequently, and that slows down the writing.

Looking at your structure, one way around it would be to not create such a large structure. If those 2 arrays could just be pointers that are allocated when needed, the structure would be much smaller and wouldn't suffer as badly from this problem. (Though it would still be worse than the array of ints case.)

As others have said, the main performance difference is likely due to the different memory page profiles. For the int array, you're likely mapping approx 100 pages of virtual memory to physical while the struct array is mapping approx 100,000. I'm assuming 4k pages and 32 bit ints but the numbers seem to gel with what you're seeing.

Your second question is: what do you do about it? As is often the case with these things, the answer is that it depends.

Now, if you're expecting that iterating over the n field is a common operation then you're in luck. The way you've structured your data is what's known as an Array of Structs (AoS). As the name implies, you've created an array of structs. There is an alternative that's used in applications that process large regular arrays like yours and that is, cunningly, called Struct of Arrays (SoA).

So, for your example, you'd change your code as follows:

#define NT 100000

typedef struct {
    float ind_1[4096][NT];
    float ind_2[4096][NT];
    int n[NT];
 } ind_vec_SoA;

 ind_vec_SoA *my_array_soa = new ind_vec_SoA;

 for(int i = 0; i<100; i++){ // This loop is fast now (hopefully)!
    my_array_soa.n[i] = 1; 
 }

With this approach, you're likely to get memory locality similar to the individual int array as the int n[NT] array is likely to be allocated contiguously. If you process the n fields, it'll have nice performance characteristics. If, however, you process multiple fields in a struct at a time, the performance characteristics will be much worse than your original design.

It's also worth pointing out that it's not pretty as far as encapsulation goes (at least not in C++) but it is a fairly common paradigm in games and HPC applications. To the extent that languages designed for those fields e.g. jai and chapel cater for this transformation more directly.

The "n" fields in my_array_1 are not subsequent in memory, and then they are mapped to different memory cache blocks.

Instead my_array_2 elements share the same cache blocks.

Short answer: Cache locality.
Processors use small fast memory (named cache) to store values that they work with. If the value is not in cache they look up RAM which is very, very slow. Usually processors try to predict cache loads ahead of time to somewhat mitigate that time loss. For example, if you iterate over simple int a[1000000000] it will be fast because it's easy to predict future cache misses. But since each element in my_array_1 is effectively the size of L1 cache you have potential cache miss every iteration of your for loop. Not to mention that L2 cache is usually around 256kb which is size_of(ind_vec)*8. so, after first 8 iterations you have 100% cache miss rate.

A quick solution is to dynamically allocate float arrays in ind_vec:

typedef struct {
  float* ind_1 = new float[4096];
  float* ind_2 = new float[4096];
  int n;
} ind_v

This way you program only loads ~12 bytes at a time, which is a lot more manageable.

Well, of course the first loop is slower. Without even getting that deep into page faults and how the OS pages in memory on access, the loop stride is 8200 bytes long (assuming 32-bits of padding) to get from one n data field to the next.

The machine is going to typically fetch all the surrounding field data in ind_vec when loading in pages (say 4k pages) down to cache lines (say 64 bytes) down to a register (say 64-bits), only to access the n field (say 4 bytes) and then waste that whole time paging in 4 kilobytes of memory and moving down the hierarchy with 64 byte cache lines down to a general purpose register only to process 4 of the 64 bytes of data in the cache line and 4 of the 4 kilobytes of data in the page. You're basically making the machine waste so much time moving memory not relevant to your loop (accessing only 4/8200 bytes of your ind_vec struct each iteration) down the hierarchy. That's going to obliterate spatial locality.

Meanwhile with your second loop, you are generally accessing all data loaded into a page and into a cache line prior to eviction. The second loop goes through a densely packed array of integers where you aren't causing the hardware and OS to waste time moving memory down the hierarchy only to process a minuscule fraction of it prior to eviction.

Memory Hierarchy

The hardware has a memory hierarchy ranging from smallest (register) to largest (disk followed by DRAM) and likewise fastest to slowest (smallest being fastest, largest being slowest).

To avoid accessing the big but slow memory too often, the machines loads in memory from slow regions in contiguous chunks (ex: 4 kilobytes for a page, 64 bytes for a cache line). It grabs memory from the slow types of memory by the handful, so to speak. It does that whenever you request to do anything with a specific memory address if the memory around the address isn't already paged in or already in a cache line. To get maximum benefit with this process, you want to write code in a way so that when the hardware and OS are grabbing data from slow memory in big, contiguous chunks (by the handful) and moving down the memory hierarchy, you're not going to waste that expensive process by just accessing a few bytes of those chunks before trying to access a lot of memory elsewhere from a totally different, distant address (which is what your first loop does).

M&M Analogy

The code you have originally with the first loop is making the hardware jump through hoops only to waste most of the effort. It's like using a giant spoon to dig into a bowl of M&Ms only to then pick out and eat only the green M&Ms and then toss the remaining M&Ms aside. That's very inefficient as opposed to eating all colors of M&Ms in a spoonful at once or having a bowl consisting only of green M&Ms, the ones you are actually going to eat, so that you can eat entire spoonfuls of green M&Ms at once.

Meanwhile the second loop is like eating from a bowl of only green M&Ms. With that second loop, you can eat entire spoonfuls of green M&Ms at once since you're accessing all the data in the array and not skipping massive sections with an epic 8 kilobyte stride.

Hot/Cold Field Splitting

Q2) Is there any workaround here?

The most direct answer to the conceptual problem in your example case is hot/cold field splitting. Don't put a boatload of data you don't access in critical paths in the same structure or class as data you access frequently. Store that on the side in parallel, like so:

struct ind_vec_cold {
    // Cold fields not accessed frequently. 
    float ind_1[4096];
    float ind_2[4096];
};

struct ind_vec_hot {
    // Hot fields accessed frequently.
    int n;
};

struct ind_vec {
    ind_vec(int n): hot(n), cold(n) {}
    vector<ind_vec_hot> hot;
    vector<ind_vec_hot> cold;
};

ind_vec data(NT);
for(int i = 0; i<100; i++){ // Not slow anymore!
    data.hot[i].n = 1; 
}

Using the M&M analogy above, this hot/cold field splitting approach effectively achieves the effect of having a bowl of only green M&Ms so that we can now consume them quickly by the spoonful. We use a different bowl to store the other M&M colors. Previously you had a mixed bowl of M&Ms and were making the machine grab M&Ms by the spoonful only to pick out the few green M&Ms in that spoon and then toss the rest aside.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange