C++ Using `.reserve()` to pad `std::vector`s as a way of protecting against multithreading cache invalidation and false sharing

https://stackoverflow.com/questions/8531741

18-03-2021
|

문제

I have a program with the general structure shown below. Basically, I have a vector of objects. Each object has member vectors, and one of those is a vector of structs that contain more vectors. By multithreading, the objects are operated on in parallel, doing computation that involves much accessing and modifying of member vector elements. One object is acessed by only one thread at a time, and is copied to that thread's stack for processing.

The problem is that the program fails to scale up to 16 cores. I suspect and am advised that the issue may be false sharing and/or cache invalidation. If this is true, it seems that the cause must be vectors allocating memory too close to each other, as it is my understanding that both problems are (in simple terms) caused by proximal memory addresses being accessed simultaneously by different processors. Does this reasoning make sense, is it likely that this could happen? If so, it seems that I can solve this problem by padding the member vectors using .reserve() to add extra capacity, leaving large spaces of empty memory between vector arrays. So, does all this make any sense? Am I totally out to lunch here?

struct str{
    vector <float> a;   vector <int> b;      vector <bool> c;  };

class objects{
    vector <str> a;     vector <int> b;      vector <float> c;  
    //more vectors, etc ...
    void DoWork();            //heavy use of vectors
};    

main(){
    vector <object> objs;
    vector <object> p_objs = &objs;

    //...make `thread_list` and `attr`
    for(int q=0; q<NUM_THREADS; q++)
        pthread_create(&thread_list[q], &attr, Consumer, p_objs );
    //...
}

void* Consumer(void* argument){
     vector <object>* p_objs = (vector <object>*) argument ;
     while(1){
         index = queued++;  //imagine queued is thread-safe global
         object obj = (*p_objs)[index]        
         obj.DoWork();
         (*p_objs)[index] = obj;
}

해결책

Well, the last vector copied in thread 0 is objs[0].c. The first vector copied in thread 1 is objs[1].a[0].a. So if their two blocks of allocated data happen to both occupy the same cache line (64 bytes, or whatever it actually is for that CPU), you'd have false sharing.

And of course the same is true of any two vectors involved, but just for the sake of a concrete example I have pretended that thread 0 runs first and does its allocation before thread 1 starts allocating, and that the allocator tends to make consecutive allocations adjacent.

reserve() might prevent the parts of that block that you're actually acting on, from occupying the same cache line. Another option would be per-thread memory allocation -- if those vectors' blocks are allocated from different pools then they can't possibly occupy the same line unless the pools do.

If you don't have per-thread allocators, the problem could be contention on the memory allocator, if DoWork reallocates the vectors a lot. Or it could be contention on any other shared resource used by DoWork. Basically, imagine that each thread spends 1/K of its time doing something that requires global exclusive access. Then it might appear to parallelize reasonably well up to a certain number J <= K, at which point acquiring the exclusive access significantly eats into the speed-up because cores are spending a significant proportion of time idle. Beyond K cores there's approximately no improvement at all with extra cores, because the shared resource cannot work any faster.

At the absurd end of this, imagine some work that spends 1/K of its time holding a global lock, and (K-1)/K of its time waiting on I/O. Then the problem appears to be embarrassingly parallel almost up to K threads (irrespective of the number of cores), at which point it stops dead.

So, don't focus on false sharing until you've ruled out true sharing ;-)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow