From Whence the Randomization in Randomized Quicksort

https://cs.stackexchange.com/questions/11385

16-10-2019
|

Question

Cormen talks briefly about the advantages of picking a random pivot in quicksort. However as pointed out here(4th to the last paragraph):

Using a random number generator to choose the positions is relatively expensive

So how is picking a random pivot actually implemented in practice, and how random is it? It can't be too expensive, since from what I understand one of quicksort's main advantages over other $\cal{O}(n \lg n)$ sorts is the small constant factors, and spending allot of cycles picking pivots would undermine that advantage.

EDIT

As an example, the C code from "Three Beautiful Quicksorts" actually calls the C library rand function:

int randint(int l, int u) {
    return rand()%(u-l+1)+l;
}

void quicksort(int l, int u) {
    int i, m;
    if (l >= u) return;
    swap(l, randint(l, u));
    m = l;
    for (i = l+1; i <= u; i++)
        if (x[i] < x[l])
            swap(++m, i);
    swap(l, m);
    quicksort(l, m-1);
    quicksort(m+1, u);
}

While the pivot picking code here is clearly $\cal{O}(1)$, it would seem that the hidden $c$ here is relatively high.

Solution

Take a look at McIllroy and Douglas' "A Killer Adversary for Quicksort" (Software -- Practice and Experience 29(4): 341-344 (1999)) or Crosby and Wallach's "Denial of Service via Algorithmic Complexity Attacks" for the reason behind randomizing. Quicksort behaves very badly if you pick (nearly) the largest/smallest as pivot each time. For randomization to make sense, it has to be random (if I know how you pick pivots deterministically, I can cook up an array that forces you to go into $O(n^2)$ behaviour, see the paper mentioned for details). But a fancy RNG is costlier than, say, taking 3 or another smallish odd number of elements, and pick the median of those as pivot, which is another way to counteract bad behaviour. So if you choose randomization, use a fast RNG seeded appropiately (probably a linear congruential scheme will be enough).

Algorithms can (and should) be compared by $O(\cdot)$, but if you are talking about different implementations of the same algorithm one must switch to more detailed analysis. In fact, Sedgewick and Flajolet in "An introduction to the analysis of algorithms" argue that one shouldn't use $T(n) = O(f(n))$ at all, but should strive to get $T(n) \sim f(n)$ type asymptotics.

OTHER TIPS

So after taking the time to sit down and actually watch Bentley's google lecture, Three Beautiful Quicksorts, it turns out that randomized pivot's are not faster than other methods. Specifically, according to Bently - who with McIlroy - rewrote the standard C library qsort function, we have the following from their paper, Engineering a Sort Function:

$1.386\;\cdot n \lg n$ average comparisons using first, middle or a randomized pivot
$1.188\;\cdot n \lg n$ average comparisons using a median of 3 pivot
$1.094\;\cdot n \lg n$ average comparisons using a median of 3 medians pivot

According to the above paper:

Our final code therefore chooses the middle element of smaller arrays, the median of the first, middle and last elements of a mid-sized array, and the pseudo-median of nine evenly spaced elements of a large array.

I read the following in Data Structures Using C, by Tenenbaum, Langsam and Augenstein:

It is possible to speed up quicksort for sorted files by choosing a random element of each subfile as the pivot value. If a file is known to be nearly sorted, this might be a good strategy ( although, in that case choosing the middle element as a pivot would be even better ). However, if nothing is known about the file, such a strategy does not improve the worst case behaviour, since it is possible (although improbably) that the random element chosen each time might consistently be the smallest element of each subfile. As a practical matter, sorted files are more common than a good random number generator happening to choose the smallest element repeatedly.

In their book they use the Hoare partition scheme. Further down they say:

It can be shown, however, that on the average (over all files of size $n$), the quicksort makes approximately $1.386 n \lg n$ comparisons even in its unmodified version.

He's referring here to picking the first element as the pivot.

EDIT

I think I just stumbled on the answer to how a deterministic selection of pivot outperforms a randomized one.

Question 9.3-3 in Cormen is, "Show how quicksort can be made to run in $O(n \lg n)$ time in the worst case."

The answer is to use deterministic selection to select the median element each time. Since deterministic selection runs in $O(n)$ time worst case, you get the worst case recursion $$T(n)=2T(n/2)+Θ(n)=Θ(n \lg n)$$ for this deterministic quicksort, admittedly with a large hidden constant factor in there.

I think that what Bentely is doing is an extension of this idea using a pseudo median. Bentely's paper references another paper which quantifies the quality of these pseudo medians.

The cost of computing the pseudo median is $\Theta(1)$ whereas finding the real median is $O(n)$. This is probably a big part of where his very small constant factor comes from. I assume that the paper Bentely references proves somehow that the pseudo median is "good enough" to guarantee a high enough percentage of good splits by the partition function to give the above run times.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange