Question

I define an algorithm to be best if it minimizes the total run time over commodity hardware (e.g. normal desktop and server PCs).

I have sets A and B. I also have function f that works like f(a, b, n), where a is some item in A, b is some item in B, and n is some natural number. The output that this f(a, b, n) returns is a positive real number.

These properties are true:

  • f(a,b,n) has a randomized component in it. So invoking the function twice will likely give you two different answers each time (albeit close to each other, but nonetheless different). Larger values of n will only reduce the estimation error of the function $f$.
  • f(a,b,n) is twice slower as f(a,b,n/2).
  • f(a,b,n) = (f(a,b,n/2) + f(a,b,n/2))/2
  • Computing f(a_1,b_1,n) is independent from f(a_2,b_2,n), where a_1,a_2 are distinct elements from A, and b_1,b_2 are also distinct elements from B.

My goal is to compute the following the value of answer as follows:

count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n)
        count += 1

answer = output / count

return answer

But the code above is highly serialized. What I wish to do is to parallelize that such that I minimize the total number of seconds needed for me to obtain answer answer.

Just to re-iterate: I am running this application on a single computer, and this is why I am considering a multi-threaded application. I do not wish to distribute it over multiple machines. All I want is to just run it really fast over the many cores that I have on a single computer.

Was it helpful?

Solution

This answer builds on top of Erik Eidt's answer.

There are three methods by which you can slice the data: subranges of A, subranges of B, and subranges of N, the latter of which you've already done with your own answer.

In addition to just slicing one of the sets, it is possible to slice on both A and B.

Before I explain, I must first apologize for not using mathematical notations. This site does not support MathJax, therefore it is not possible to write notations that contain more than one level of subscripts.

The name for slicing on more than one subset (or dimension of data) is Loop Tiling.

To simplify things, let's just slice A as follows: (A1...A10), (A11...A20), (A21...A30), ... and likewise B as follows: (B1...B10), (B11...B20), (B21...B30)

The choice of slicing A and B into groups of 10 is arbitrary. You should experiment with different grouping sizes to find a more optimal combination.

The pseudocode is then:

for each ten consecutive items taken from A, namely A(10*j+1...10*j+9)
    for each ten consecutive items taken from B, namely B(10*k+1...10*k+9)
        for each item in the subrange of A(10*j+1...10*j+9), namely "a"
            for each item in the subrange of B(10*k+1...10*k+9), namely "b"
                for i=1...n
                    process f(a, b, ...)

Some details are omitted in the pseudocode in order to highlight the specifics of the "loop tiling" technique.

I omitted the parallelization aspect, but basically one can choose one of the "for" and convert it into "parallel for" to fully maximize the multicore utilization.


Some more general advice.

As explained in the linked Wikipedia article above, there is no simple rule that will help choose the most optimal subrange size for A and B. Real-world computers and software execution performance are influenced by too many factors. Instead, one has to try different values and see which combination runs faster. From performance experimentation one can discover which of the software performance factors is dominating. This can only be discovered, and is not easy to predict by theory alone.

Since you are using C, you have several choices:

  • OpenMP
  • MPI
  • Manually written multithreaded programming using pthreads

I see that you have pthreads in the tag. However, for this type of computation, it would be easier to achieve better performance by using OpenMP instead.

You should already know about thread safety. If you do not, it is likely that none of your multithreaded program could give you any correct answer in the first place. So it is very important that you either know what it is, or else you should avoid using multiple threads.

One way to increase multi-core CPU utilization without multithreading is to use multiple processes (OS processes). Namely, open multiple console terminals, and launch the program in each terminal. Instruct each program to save its output to a different filename. When all programs finish, combine these different files into the final result.

OTHER TIPS

There's not a lot to work with here. You are not divulging f and also not asking to parallelize the internals of f, just the doubly-nested loops that invoke f. While there is some relationship between f(,,n) and f(,,n-1), I don't see how to take advantage of it because of the undisclosed randomizing component.

(I presume this is your intent, but just to be clear, there may be a better solution if we could understand the randomizing component, since repeating that over and over looks like where all the work really is, and doing something different instead might be most effective.)

So, the only thing you can do is slice the data to keep all the cores busy.

There are three methods by which you can slice the data: subranges of A, subranges of B, and subranges of N, the latter of which you've already done with your own answer.

You also haven't divulged the structure of A or B, except that they are obviously collections, or at least generators.

If they are collection manifested by memory (arrays, lists, etc..), then if they are of significant size, then iterating over A and B by each core could thrash the cache, unless somehow the cores cooperate and happen to run at same range of A & B at the same time, then they'll actually get a boost from each other!

But if they happen to get significantly out of sync with each other (say because conditional logic in f is not evaluating the same), then they'll be fighting each other for the cache.

(An analogy is doing copying of large folders/directories on your hard drive. If start another copy of different folders at the same time, both copies will more than likely slow to a crawl, and together take 10x of the serially run sum of the copies.)

So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another subset of the A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.

Of course, you also want to spawn as few threads as you can because that represents overhead as well, which is a benefit of the slicing in your answer. But it is possible that out-of-sync threads thrash the cache sufficiently fixing that would be worth spawning additional threads.

On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.

So each core would march thru the same A x B subset using your notion of n/C iterations, then all the cores would move on to the next A x B subset.


In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again (and maybe even again and again just for one iteration thru A x B), whereas a single thread running on a manageable subset could bring each such subset into the cache only once for the n iterations, so you might start with a single threaded version of matrix subsetting, and then add the parallel synchronization.

Here is the solution to the much more general problem of parallelizing any loop cleanly.

Just let each thread pull items (pairs of a,b in your case) one after another and compute the partial result until all are consumed.

Main thread:

output = 0
iter = iterator_over(A,B)

// start threads and wait until done

answer = output / size(A) / size(B)
return answer

Each thread:

res = 0
while true:
   synchronized:
       if !iter.hasNext():
           break
       a,b = it.next()

   res += f(a, b, n)

synchronized:
    output += res

For optimal performance, the amount of threads should be the same as the amount of cpu cores.

Say that I have C many cpu cores, and that I wish to run this in parallel on C cores. Here is my current solution:

First thread:

THREAD_1
count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n/C)
        count += 1

thread_1_answer = output / count

return thread_1_answer

Second thread:

THREAD_2
count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n/C)
        count += 1

thread_2_answer = output / count

return thread_2_answer

...

Cth thread:

THREAD_C
count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n/C)
        count += 1

thread_C_answer = output / count

return thread_C_answer

Then finally I define the final answer as follows:

answer = (thread_1_answer + thread_2_answer + ... + thread_C_answer)/C
Licensed under: CC-BY-SA with attribution
scroll top