문제

I have created a median-of-3 standard Quick-sort implementation which sorts a large array of random integers. I would like to go up to at least a 100 million elements but preferably a billion. To increase speed, I am attempting to parallelize the algorithm in Cilk++. The algorithm uses double recursion, and I spawn Cilk tasks to perform each recursive sort.

My algorithm works for arrays up to size 10 000 000. Without the Cilk keywords, my sequential algorithm handles 100 million elements easily but when I try to use Cilk the program crashes to desktop. I would now like to find out the reason for this. Am I generating too many Cilk tasks too quickly?

I am using Windows 7 64bit, Intel++ compiler and Intel Parallel Studio XE 2013 integrated in Visual Studio 2010. The program is compiled as a 32-bit application. The memory where the random data is stored is allocated as a single call to malloc, and the pointer is checked. In median calculation integer overflow is also guarded against when calculating the middle element.

This is mostly likely a Buffer Overrun issue.

This is my partition:

int pivotvalue = medianOf3(data, low, high);
// pivot is placed next to the last element

int left = low;
int right = high - 1;
while (left < right) {
    while (data[left] < pivotvalue) {
        left++;
        if (left > high) {
            break;
        }
    }
    while (data[right] >= pivotvalue) {
        right--;
        if (right < low) {
            break;
        }
    }

    if (left < right) {
        swap(data, left, right);
    }
}

// restore pivot
swap(data, left, high - 1);
return left;
도움이 되었습니까?

해결책

I don't know how Cilk works, but I recall needing to fix a quicksort implementation on an embedded platform that could overflow the stack with recursion. The fix was to use a recursive call for the smaller "half" of the data and process the larger "half" inside the function (i.e., tail recursion). Sorted (or reverse sorted) lists would always trigger overflow since the depth of the call graph was equal to the number of elements in the list.

Can you use the debugger to find out what causes the crash? Can you dump data to a log file on each entry into swap(), and then see what the function calls before crash looked like? Is it possible to dump the size of the stack on each call? Does each Cilk task have its own stack that is possibly smaller than the stack used in the non-Cilk version?

You could track stack usage by adding a parameter to swap() with a stack address. The first call and any new Cilk thread uses NULL. Each recursive call uses that parameter if it wasn't NULL, or the address of one of its local variables to establish where its stack is. Dump the difference in addresses, or track it in a global and only report it when it exceeds the previous maximum.

다른 팁

Intel Cilk Plus (cilk++ is a different product that's not supported) has a limit on the spawn depth of 1024. It's entirely possible that you're overflowing your deque, which would cause a crash.

Deciding how deep in your recursive tree to spawn is a balancing act. You want to have enough spawns to allow work to be stolen, but using too many adds overhead. cilk_spawn is cheap, but it's not free. It's probably best to add a check to your quicksort and not spawn your recursive call if the number of elements to be sorted is below some threshold.

One of the benefits of cilk_for is that it can optimize the spawning depth based on the number of workers you're running with.

- Barry Tannenbaum
  Intel Cilk Plus Runtime Development

A quicksort on N items that recurses for both subproblems has (worst-case) recursion depth O(N). The usual fix is the one suggested by "tomlogic": recurse on the smaller subproblem, iterate on the bigger subproblem. That reduces the recursion depth to O(lg N).

The fix carries over to the parallel version. If the serial program takes stack space S, the Cilk Plus version will take at most stack space PS. (This property is not true of many other parallel frameworks.) So spawning the smaller subproblem and iterating on the bigger subproblem should keep the total stack space within O(P lg N).

I'm one of the authors of a book Structured Parallel Programming, which discusses the parallel implementation of Quicksort in Cilk Plus and TBB. The examples (both fully recursive and semi-recursive) of Quicksort can be downloaded for free from http://parallelbook.com/student.

Arch D. Robison
Intel Corporation
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top