Why is quicksort better than other sorting algorithms in practice?

https://cs.stackexchange.com/questions/3

15-10-2019
|

Pergunta

In a standard algorithms course we are taught that quicksort is $O(n \log n)$ on average and $O(n^2)$ in the worst case. At the same time, other sorting algorithms are studied which are $O(n \log n)$ in the worst case (like mergesort and heapsort), and even linear time in the best case (like bubblesort) but with some additional needs of memory.

After a quick glance at some more running times it is natural to say that quicksort should not be as efficient as others.

Also, consider that students learn in basic programming courses that recursion is not really good in general because it could use too much memory, etc. Therefore (and even though this is not a real argument), this gives the idea that quicksort might not be really good because it is a recursive algorithm.

Why, then, does quicksort outperform other sorting algorithms in practice? Does it have to do with the structure of real-world data? Does it have to do with the way memory works in computers? I know that some memories are way faster than others, but I don't know if that's the real reason for this counter-intuitive performance (when compared to theoretical estimates).

Update 1: a canonical answer is saying that the constants involved in the $O(n\log n)$ of the average case are smaller than the constants involved in other $O(n\log n)$ algorithms. However, I have yet to see a proper justification of this, with precise calculations instead of intuitive ideas only.

In any case, it seems like the real difference occurs, as some answers suggest, at memory level, where implementations take advantage of the internal structure of computers, using, for example, that cache memory is faster than RAM. The discussion is already interesting, but I'd still like to see more detail with respect to memory-management, since it appears that the answer has to do with it.

Update 2: There are several web pages offering a comparison of sorting algorithms, some fancier than others (most notably sorting-algorithms.com). Other than presenting a nice visual aid, this approach does not answer my question.

Solução

Short Answer

The cache efficiency argument has already been explained in detail. In addition, there is an intrinsic argument, why Quicksort is fast. If implemented like with two “crossing pointers”, e.g. here, the inner loops have a very small body. As this is the code executed most often, this pays off.

Long Answer

First of all,

The Average Case does not exist!

As best and worst case often are extremes rarely occurring in practice, average case analysis is done. But any average case analysis assume some distribution of inputs! For sorting, the typical choice is the random permutation model (tacitly assumed on Wikipedia).

Why $O$-Notation?

Discarding constants in analysis of algorithms is done for one main reason: If I am interested in exact running times, I need (relative) costs of all involved basic operations (even still ignoring caching issues, pipelining in modern processors ...). Mathematical analysis can count how often each instruction is executed, but running times of single instructions depend on processor details, e.g. whether a 32-bit integer multiplication takes as much time as addition.

There are two ways out:

Fix some machine model.

This is done in Don Knuth's book series “The Art of Computer Programming” for an artificial “typical” computer invented by the author. In volume 3 you find exact average case results for many sorting algorithms, e.g.
- Quicksort: $ 11.667(n+1)\ln(n)-1.74n-18.74 $
- Mergesort: $ 12.5 n \ln(n) $
- Heapsort: $ 16 n \ln(n) +0.01n $
- Insertionsort: $2.25n^2+7.75n-3ln(n)$
  ^[source]
These results indicate that Quicksort is fastest. But, it is only proved on Knuth's artificial machine, it does not necessarily imply anything for say your x86 PC. Note also that the algorithms relate differently for small inputs:

^[source]
Analyse abstract basic operations.

For comparison based sorting, this typically is swaps and key comparisons. In Robert Sedgewick's books, e.g. “Algorithms”, this approach is pursued. You find there
- Quicksort: $2n\ln(n)$ comparisons and $\frac13n\ln(n)$ swaps on average
- Mergesort: $1.44n\ln(n)$ comparisons, but up to $8.66n\ln(n)$ array accesses (mergesort is not swap based, so we cannot count that).
- Insertionsort: $\frac14n^2$ comparisons and $\frac14n^2$ swaps on average.
As you see, this does not readily allow comparisons of algorithms as the exact runtime analysis, but results are independent from machine details.

Other input distributions

As noted above, average cases are always with respect to some input distribution, so one might consider ones other than random permutations. E.g. research has been done for Quicksort with equal elements and there is nice article on the standard sort function in Java

Outras dicas

There are multiple points that can be made regarding this question.

Quicksort is usually fast

Although Quicksort has worst-case $O(n^2)$ behaviour, it is usually fast: assuming random pivot selection, there's a very large chance we pick some number that separates the input into two similarly sized subsets, which is exactly what we want to have.

In particular, even if we pick a pivot that creates a 10%-90% split every 10 splits (which is a meh split), and a 1 element - $n-1$ element split otherwise (which is the worst split you can get), our running time is still $O(n \log n)$ (note that this would blow up the constants to a point that Merge sort is probably faster though).

Quicksort is usually faster than most sorts

Quicksort is usually faster than sorts that are slower than $O(n \log n)$ (say, Insertion sort with its $O(n^2)$ running time), simply because for large $n$ their running times explode.

A good reason why Quicksort is so fast in practice compared to most other $O(n \log n)$ algorithms such as Heapsort, is because it is relatively cache-efficient. Its running time is actually $O(\frac{n}{B} \log (\frac{n}{B}))$, where $B$ is the block size. Heapsort, on the other hand, doesn't have any such speedup: it's not at all accessing memory cache-efficiently.

The reason for this cache efficiency is that it linearly scans the input and linearly partitions the input. This means we can make the most of every cache load we do as we read every number we load into the cache before swapping that cache for another. In particular, the algorithm is cache-oblivious, which gives good cache performance for every cache level, which is another win.

Cache efficiency could be further improved to $O(\frac{n}{B} \log_{\frac{M}{B}} (\frac{n}{B}))$, where $M$ is the size of our main memory, if we use $k$-way Quicksort. Note that Mergesort also has the same cache-efficiency as Quicksort, and its k-way version in fact has better performance (through lower constant factors) if memory is a severe constrain. This gives rise to the next point: we'll need to compare Quicksort to Mergesort on other factors.

Quicksort is usually faster than Mergesort

This comparison is completely about constant factors (if we consider the typical case). In particular, the choice is between a suboptimal choice of the pivot for Quicksort versus the copy of the entire input for Mergesort (or the complexity of the algorithm needed to avoid this copying). It turns out that the former is more efficient: there's no theory behind this, it just happens to be faster.

Note that Quicksort will make more recursive calls, but allocating stack space is cheap (almost free in fact, as long as you don't blow the stack) and you re-use it. Allocating a giant block on the heap (or your hard drive, if $n$ is really large) is quite a bit more expensive, but both are $O(\log n)$ overheads that pale in comparison to the $O(n)$ work mentioned above.

Lastly, note that Quicksort is slightly sensitive to input that happens to be in the right order, in which case it can skip some swaps. Mergesort doesn't have any such optimizations, which also makes Quicksort a bit faster compared to Mergesort.

Use the sort that suits your needs

In conclusion: no sorting algorithm is always optimal. Choose whichever one suits your needs. If you need an algorithm that is the quickest for most cases, and you don't mind it might end up being a bit slow in rare cases, and you don't need a stable sort, use Quicksort. Otherwise, use the algorithm that suits your needs better.

In one of the programming tutorials at my university, we asked students to compare the performance of quicksort, mergesort, insertion sort vs. Python's built-in list.sort (called Timsort). The experimental results surprised me deeply since the built-in list.sort performed so much better than other sorting algorithms, even with instances that easily made quicksort, mergesort crash. So it's premature to conclude that the usual quicksort implementation is the best in practice. But I'm sure there much better implementation of quicksort, or some hybrid version of it out there.

This is a nice blog article by David R. MacIver explaining Timsort as a form of adaptive mergesort.

I think one of the main reasons why QuickSort is so fast compared with other sorting algorithms is because it's cache-friendly. When QS processes a segment of an array, it accesses elements at the beginning and end of the segment, and moves towards the center of the segment.

So, when you start, you access the first element in the array and a piece of memory (“location”) is loaded into the cache. And when you try to access the second element, it's (most likely) already in the cache, so it's very fast.

Other algorithms like heapsort don't work like this, they jump in the array a lot, which makes them slower.

Others have already said that the asymptotic average runtime of Quicksort is better (in the constant) than that of other sorting algorithms (in certain settings).

What does that mean? Assume any permutation is chosen at random (assuming uniform distribution). In this case, typical pivot selection methods provide pivots that in expectation divide the list/array roughly in half; that is what brings us down to $\cal{O}(n \log n)$. But, additionally, merging partial solutions obtained by recursing takes only constant time (as opposed to linear time in case of Mergesort). Of course, separating the input in two lists according to the pivot is in linear time, but it often requires few actual swaps.

Note that there are many variants of Quicksort (see e.g. Sedgewick's dissertation). They perform differently on different input distributions (uniform, almost sorted, almost inversely sorted, many duplicates, ...), and other algorithms might be better for some.

Another fact worth noting is that Quicksort is slow on short inputs compared to simper algorithms with less overhead. Therefore, good libraries do not recurse down to lists of length one but will use (for example) insertion sort if input length is smaller than some $k \approx 10$.

In comparison to other comparison-based sorting algorithms with $O(n \lg n)$ time complexity, quick-sort is often considered to better than other algorithms like merge-sort because it is an in-place sorting algorithm. In other words, we don't need (much more) memory to store the members of the array.

ps: to be precise, being better than other algorithms is task dependent. For some tasks it might be better to use other sorting algorithms.

See also:

Even though quick-sort has a worst case run time of $\Theta(n^2)$, quicksort is considered the best sorting because it is VERY efficient on the average: its expected running time is $\Theta(n\log n)$ where the constants are VERY SMALL compared to other sorting algorithms. This is the main reason for using quick sort over other sorting algorithms.

The second reason is that it performs in-place sorting and works very well with virtual-memory environments.

UPDATE: : (After Janoma's and Svick's comments)

To illustrate this better let me give an example using Merge Sort (because Merge sort is the next widely adopted sort algorithm after quick sort, I think) and tell you where the extra constants come from (to the best of my knowledge and why I think Quick sort is better):

Consider the following seqence:

12,30,21,8,6,9,1,7. The merge sort algorithm works as follows:

(a) 12,30,21,8    6,9,1,7  //divide stage
(b) 12,30   21,8   6,9   1,7   //divide stage
(c) 12   30   21   8   6   9   1   7   //Final divide stage
(d) 12,30   8,21   6,9   1,7   //Merge Stage
(e) 8,12,21,30   .....     // Analyze this stage

If you care fully look how the last stage is happening, first 12 is compared with 8 and 8 is smaller so it goes first. Now 12 is AGAIN compared with 21 and 12 goes next and so on and so forth. If you take the final merge i.e. 4 elements with 4 other elements, it incurs a lot of EXTRA comparisons as constants which is NOT incurred in Quick Sort. This is the reason why quick sort is preferred.

My experience working with real world data is that quicksort is a poor choice. Quicksort works well with random data, but real world data is most often not random.

Back in 2008 I tracked a hanging software bug down to the use of quicksort. A while later I wrote simple implentations of insertion sort, quicksort, heap sort and merge sort and tested these. My merge sort outperformed all the others while working on large data sets.

Since then, merge sort is my sorting algorithm of choice. It is elegant. It is simple to implement. It is a stable sort. It does not degenerate to quadratic behaviour like quicksort does. I switch to insertion sort to sort small arrays.

On many occasions I have found my self thinking that a given implementation works surprisingly well for quicksort only to find out that it actually isn't quicksort. Sometimes the implementation switches between quicksort and another algorithm and sometimes it does not use quicksort at all. As an example, GLibc's qsort() functions actually uses merge sort. Only if allocating the working space fails does it fall back to in-place quicksort which a code comment calls "the slower algorithm".

Edit: Programming languages such as Java, Python and Perl also use merge sort, or more precisely a derivative, such as Timsort or merge sort for large sets and insertion sort for small sets. (Java also uses dual-pivot quicksort which is faster than plain quicksort.)

1 - Quick sort is inplace (doesn't need extra memmory, other than a constant amount.)

2 - Quick sort is easier to implement than other efficient sorting algorithms.

3 - Quick sort has smaller constant factors in it's running time than other efficient sorting algorithms.

Update: For merge sort, you need to do some "merging," which needs extra array(s) to store the data before merging; but in quick sort, you don't. That's why quick sort is in-place. There are also some extra comparisons made for merging which increase constant factors in merge sort.

Under what conditions is a specific sorting algorithm actually the fastest one?

1) When implemented in a parallel way in hardware, does it need to have a reasonably low latency while requiring as few gates as possible? Yes -> use a bitonic sorter or Batcher odd-even mergesort, latency is $\Theta(\log(n)^2)$ and the number of comparators and multiplexers is $\Theta(n \cdot \log(n)^2)$.

2) How many different values can each element have? Cans every possible value have assigned a unique place in memory or cache Yes -> use count sort or radix sort, those usually have a linear runtime of $\Theta(n \cdot k)$ (count sort) or $\Theta(n \cdot m)$ (bucket sort) but slow down for a large number of different values, as $k=2^{\#number\_of\_Possible\_values}$ and $m = \#maximum\_length\_of\_keys$.

3) Does the underlying data structure consist of linked elements? Yes -> allways use in place merge sort. There are both easy to implement fixed size or adaptive(aka natural) bottom-up in place merge sorts of different arities for linked data structures, and since they never require copying the entire data in each step and they never require recursions either, they are faster than any other general comparision-based sorts, even faster than quick sort.

4) Does the sorting need to be stable? Yes -> use merge sort, either in place or not, fixed-size or adaptive, depending on the underlying data structure and the kind of data to be expected, even in cases where quick sort would otherwise be preferred, as stabilizing an arbitrary sorting algorithm requires $\Theta(n)$ additional memory in the worst case consisting of original indexes, which also needs to be kept in sync with each swap that is to be performed on the input data, so that every performance gain that quick sort might have over merge sort is probably thwarted.

5) Can the size of the underlying data be bound to a small to medium size? e.g. Is n < 10,000...100,000,000 (depending on the underlying architecture and data structure)? Yes -> use bitonic sort or Batcher odd-even mergesort. Goto 1)

6) Can you spare another $\Theta(n)$ memory? Yes -> a) Does the input data consist of large pieces of already sorted sequential data? -> use adaptive (aka natural) merge sort or tim sort Yes -> b) Does the input data mostly consist of elements that are almost in the correct place? -> use bubble sort or insertion sort. If you fear their $\Theta(n^2)$ time complexity (which is pathological for almost sorted data), maybe consider switching to shell sort with an (almost) asymptotically optimal sequence of gaps, some sequences that yield $\Theta(n \cdot \log(n)^2)$ worst case run time are known, or maybe try comb sort. I'm not sure either shell sort or comb sort would perform reasonably good in practice.

No -> 7) Can you spare another $\Theta(\log(n))$ memory? Yes -> a) Does the undelying data structure allow for directed sequential access or better? Yes -> Does it allow only a single sequence of read/write accesses at a time up till the end of the data has been reached (e.g. directed tape access)? Yes -> i) use merge sort, but there is no obvious way to make this case in place, so it may require additional $\Theta(n)$ memory. But if you have time and the balls to do it, there is a way to merge 2 arrays in $\Theta(n)$ time using only $\Theta(\log(n))$ space in a stable way, according to Donald E. Knuth "The Art of Computer Programming, Volume 3: Sorting and Searching", exercise 5.5.3. states that there is an algorithm by L. Trabb-Pardo that does so. However, I doubt this would be any faster than the naive mergesort version or the quicksort from the case above. No, it allows multiple simultaneous accesses to a sequence of data (e.g. is not a tape drive) -> ii) use quicksort, for practical purposes I would recommend either a randomized or an approximated median one. If you are wary of pathological $\Theta(n^2)$ cases, consider using intro sort. If you are hell-bent on deterministic behavior, consider using the median-of-median algorithm to select the pivot element, it requires $\Theta(n)$ time and its naive implementation requires $\Theta(n)$ space (parallelizable), whereas it may be implemented to only require $\Theta(\log(n))$ space (not parallelizable). However, the median-of-median algorithm gives you a deterministic quicksort which has worst-case $\Theta(n \cdot \log(n))$ run-time. No -> you're screwed (sorry, we need at least 1 way of accessing each data element once)

No -> 8) Can you spare a small constant amount of memory? Yes -> Does the underlying data structure allow for random access? Yes -> use heapsort, it has an asymptotic optimal run-time of $\Theta(n \cdot \log(n))$, but dismal cache coherency and doesn't parallelize well. No -> you are screwed No -> you are screwed

Implementation hints for quicksort:

1) Naive binary quicksort requires $\Theta(n)$ additional memory, however, it is relatively easy to reduce that down to $\Theta(\log(n))$ by rewriting the last recursion call into a loop. Doing the same for k-ary quicksorts for k > 2 requires $\Theta(n^{\log_k(k-1)})$ space (according to the master theorem), so binary quicksort requires the least amount of memory, but I would be delighted to hear if anyone knows whether k-ary quicksort for k > 2 might be faster than binary quicksort on some real world setup.

2) There exist bottom-up, iterative variants of quicksort, but AFAIK, they have the same asymptotic space and time boundaries as the top-down ones, with the additional down sides of being difficult to implement (e.g. explicitly managing a queue). My experience is that for any practical purposes, those are never worth considering.

Implementation hints for mergesort:

1) bottum-up mergesort is always faster than top-down mergesort, as it requires no recursion calls.

2) the very naive mergesort may be sped up by using a double buffer and switch the buffer instead of copying the data back from the temporal array after each step.

3) For many real-world data, adaptive mergesort is much faster than a fixed-size mergesort.

4) the merge algorithm can easily be parallelized by splitting the input data into k approximately same-sized parts. This will require k references into data, and it is a good thing to choose k such that all of k (or c*k for a small constant c >= 1) fit into the nearest memory hierarchy(usually L1 data cache). Choosing the smallest out of k elements the naive way(linear search) takes $\Theta(k)$ time, whereas building up a min-heap within those k elements and choosing the smallest one requires only amortized $\Theta(\log(k))$ time (picking the minimum is $\Theta(1)$ of course, but we need to do a little maintenance as one element is removed and replaced by another one in each step). The parallelized merge always requires $\Theta(n)$ memory regardless of k.

From what I have written, it is clear that quicksort often isn't the fastest algorithm, except when the following conditions all apply:

1) there are more than a "few" possible values

2) the underlying data structure is not linked

3) we do not need a stable order

4) data is big enough that the slight sub-optimal asymptotic run-time of a bitonic sorter or Batcher odd-even mergesort kicks in

5) the data isn't almost sorted and doesn't consist of bigger already sorted parts

6) we can access the data sequence simultaneously from multiple places

7) { memory writes are particularly expensive (because that's mergesort's main disadvantage), so far as it slows down the algorithm beyond a quicksort's probable sub-optimal split. } or { we can only have $\Theta(\log(n))$ additional memory, $\Theta(n)$ is too much (e.g. external storage) }

p.s.: Someone need to help me with the formatting of the text.

Most of the sortings methods have to move data around in short steps (for example, merge sort makes changes locally, then merges this small piece of data, then merges a bigger one. ..). In consequence, you need many data movements if data is far from its destination.

Quicksort, on the other side tries to interchange numbers that are in the first part of the memory and are big, with numbers that are in the second part of the array and are small (if you are sorting $a \le b$, the argument is the same in the other sense), so they get quickly allocated near their final destination.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange