Updating large data matrix thread-safely: now using millions of mutexes?

https://stackoverflow.com/questions/8385679

28-10-2019
|

문제

I was revisting some code I wrote a long time ago, and decided to rewrite it to better make use of threads (and better use of programming in general..).

It is located here: https://github.com/buddhabrot/buddhabrot/blob/master/basic.c:

It is an application that renders a buddhabrot fractal. For reasons out of the scope of this question it is hard to use memoization to optimize this, and basically if you'd profile this, over 99% of the time is spent in the innermost loop that eventually does:

buddhabrot[col][row]++;

Multiple threads will execute this code. Since incrementing is not thread-safe, I used a specific mutex lock around this part of the memory. So, each addressable location in the buddhabrot memory has a separate mutex.

Now, this is more efficient than using one lock of course (which would definitely make all the threads wait for each other), but it is less memory efficient; it appears the mutexes take some data as well. I am also wondering about other repercussions in the pthreads implementations with millions of mutexes?

I now have two other strategies to consider:

Use a less dense set of mutex locks, for each "region" in the map. So, a lock for [col/16][row/16], for instance, would only lock a thread if it visits the same region of 16 pixels as another one. The density of the locks could be dynamically adjusted. But as I was modeling this I was wondering if I'm not solving an existing problem that might even be implemented by kernels, and I also can't really find a way to make this without slowing things down. I also thought about "trees of mutexes", but all of this is just too slow inside this loop (to give an indication, after optimizating the order of some maths operations behind the compiler's back I could squeeze out about 30% more processor time). Is there a topic for this, how do I look for mor einformation on "mutex density planning"..?
Copy the memory for each thread so I don't even have to mutex around it. But this is even more memory-inefficient. It would solve the problem of having millions of mutexes without knowing the repercussions thereof.

So, is there anything else, anything better I could do?

해결책

You can use atomic increment functions like InterlockedIncrement from the intrin.h on Windows platforms.

#include <intrin.h>

#pragma intrinsic(_InterlockedExchangeAdd, _InterlockedIncrement, _InterlockedDecrement, _InterlockedCompareExchange, _InterlockedExchange)
#define InterlockedExchangeAdd _InterlockedExchangeAdd
#define InterlockedIncrement _InterlockedIncrement
#define InterlockedDecrement _InterlockedDecrement
#define InterlockedCompareExchange _InterlockedCompareExchange
#define InterlockedExchange _InterlockedExchange

#pragma intrinsic(abs, fabs, labs, memcmp, memcpy, memset, strcat, strcmp, strcpy, strlen)
#pragma intrinsic(acos, cosh, pow, tanh, asin, fmod, sinh)
#pragma intrinsic(atan, exp, log10, sqrt, atan2, log, sin, tan, cos)

This incrementation is atomic and there is no need to have millions of mutex or a global lock on your matrix.

다른 팁

I think you should be able to partition the matrix, so that each thread will only update 1 column. That way they won't get in each others way, and you don't have to lock.

Make a central synchronized queue of all the columns, let each thread go there to get a column number, then it will only update values in that column, and go to the queue for the next column until all are done.

Then the contention will only be on the central queue, wich should be trivial compared to the rest.

Also I guess there will be enough rows in each column, so you wouldn't get false sharing, that would slow you down.

Regards GJ

Your second design is the better choice for exactly the reasons you gave. For rendering a Buddhabrot you want to build up a large matrix of sums. You can avoid memory contention if you let each processor compute it's own array and then add its results to a master array every minute or so. That's the only part that requires memory locks, and even that could be avoided by having each thread write to its own file. You do have multiple processors, right? If not, then adding threads will add no benefit.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow