Why is my radix sort python implementation slower than quick sort?

https://stackoverflow.com/questions/8279618

09-03-2021
|

Question

I rewrote the original radix sort algorithm for Python from Wikipedia using arrays from SciPy to gain performance and to reduce code length, which I managed to accomplish. Then I took the classic (in-memory, pivot based) quick sort algorithm from Literate Programming and compared their performance.

I had the expectation that radix sort will beat quick sort beyond a certain threshold, which it did not. Further, I found Erik Gorset's Blog's asking the question "Is radix sort faster than quick sort for integer arrays?". There the answer is that

.. the benchmark shows the MSB in-place radix sort to be consistently over 3 times faster than quicksort for large arrays.

Unfortunately, I could not reproduce the result; the differences are that (a) Erik chose Java and not Python and (b) he uses the MSB in-place radix sort, whereas I just fill buckets inside a Python dictionary.

According to theory radix sort should be faster (linear) compared to quick sort; but apparently it depends a lot on the implementation. So where is my mistake?

Here is the code comparing both algorithms:

from sys   import argv
from time  import clock

from pylab import array, vectorize
from pylab import absolute, log10, randint
from pylab import semilogy, grid, legend, title, show

###############################################################################
# radix sort
###############################################################################

def splitmerge0 (ls, digit): ## python (pure!)

    seq = map (lambda n: ((n // 10 ** digit) % 10, n), ls)
    buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}

    return reduce (lambda acc, key: acc.extend(buf[key]) or acc,
        reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), [])

def splitmergeX (ls, digit): ## python & numpy

    seq = array (vectorize (lambda n: ((n // 10 ** digit) % 10, n)) (ls)).T
    buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}

    return array (reduce (lambda acc, key: acc.extend(buf[key]) or acc,
        reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), []))

def radixsort (ls, fn = splitmergeX):

    return reduce (fn, xrange (int (log10 (absolute (ls).max ()) + 1)), ls)

###############################################################################
# quick sort
###############################################################################

def partition (ls, start, end, pivot_index):

    lower = start
    upper = end - 1

    pivot = ls[pivot_index]
    ls[pivot_index] = ls[end]

    while True:

        while lower <= upper and ls[lower] <  pivot: lower += 1
        while lower <= upper and ls[upper] >= pivot: upper -= 1
        if lower > upper: break

        ls[lower], ls[upper] = ls[upper], ls[lower]

    ls[end] = ls[lower]
    ls[lower] = pivot

    return lower

def qsort_range (ls, start, end):

    if end - start + 1 < 32:
        insertion_sort(ls, start, end)
    else:
        pivot_index = partition (ls, start, end, randint (start, end))
        qsort_range (ls, start, pivot_index - 1)
        qsort_range (ls, pivot_index + 1, end)

    return ls

def insertion_sort (ls, start, end):

    for idx in xrange (start, end + 1):
        el = ls[idx]
        for jdx in reversed (xrange(0, idx)):
            if ls[jdx] <= el:
                ls[jdx + 1] = el
                break
            ls[jdx + 1] = ls[jdx]
        else:
            ls[0] = el

    return ls

def quicksort (ls):

    return qsort_range (ls, 0, len (ls) - 1)

###############################################################################
if __name__ == "__main__":
###############################################################################

    lower = int (argv [1]) ## requires: >= 2
    upper = int (argv [2]) ## requires: >= 2
    color = dict (enumerate (3*['r','g','b','c','m','k']))

    rslbl = "radix sort"
    qslbl = "quick sort"

    for value in xrange (lower, upper):

        #######################################################################

        ls = randint (1, value, size=value)

        t0 = clock ()
        rs = radixsort (ls)
        dt = clock () - t0

        print "%06d -- t0:%0.6e, dt:%0.6e" % (value, t0, dt)
        semilogy (value, dt, '%s.' % color[int (log10 (value))], label=rslbl)

        #######################################################################

        ls = randint (1, value, size=value)

        t0 = clock ()
        rs = quicksort (ls)
        dt = clock () - t0

        print "%06d -- t0:%0.6e, dt:%0.6e" % (value, t0, dt)
        semilogy (value, dt, '%sx' % color[int (log10 (value))], label=qslbl)

    grid ()
    legend ((rslbl,qslbl), numpoints=3, shadow=True, prop={'size':'small'})
    title ('radix & quick sort: #(integer) vs duration [s]')
    show ()

###############################################################################
###############################################################################

And here is the result comparing sorting durations in seconds (logarithmic vertical axis) for integer arrays of size in range from 2 to 1250 (horizontal axis); lower curve belongs to quick sort:

Radix vs Quick Sort Comparison

Quick sort is smooth at the power changes (e.g. at 10, 100 or 1000), but radix sort just jumps a little but follows otherwise qualitatively the same path as quick sort, just much slower!

Solution

You have several problems here.

First of all, as pointed out in the comments, your data set is far too small for the theoretical complexity to overcome the overheads in the code.

Next your implementation with all those unnecessary function calls and copying lists around is very inefficient. Writing the code in a straightforward procedural manner will almost always be faster than a functional solution (for Python that is, other languages will differ here). You have a procedural implementation of quicksort so if you write your radix sort in the same style it may turn out faster even for small lists.

Finally, it may be that when you do try large lists the overheads of memory management begin to dominate. That means that you have a limited window between small lists where the efficiency of the implementation is the dominant factor and large lists where the memory management is the dominant factor.

Here's some code that uses your quicksort but a simple radixsort written procedurally but trying to avoid so much copying of data. You'll see that even for short lists it beats the quicksort but more interestingly as the data size goes up so does the ratio between quicksort and radix sort and then it begins to drop again as the memory management starts to dominate (simple things like freeing a list of 1,000,000 items take a significant time):

from random import randint
from math import log10
from time import clock
from itertools import chain

def splitmerge0 (ls, digit): ## python (pure!)

    seq = map (lambda n: ((n // 10 ** digit) % 10, n), ls)
    buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}

    return reduce (lambda acc, key: acc.extend(buf[key]) or acc,
        reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), [])

def splitmerge1 (ls, digit): ## python (readable!)
    buf = [[] for i in range(10)]
    divisor = 10 ** digit
    for n in ls:
        buf[(n//divisor)%10].append(n)
    return chain(*buf)

def radixsort (ls, fn = splitmerge1):
    return list(reduce (fn, xrange (int (log10 (max(abs(val) for val in ls)) + 1)), ls))

###############################################################################
# quick sort
###############################################################################

def partition (ls, start, end, pivot_index):

    lower = start
    upper = end - 1

    pivot = ls[pivot_index]
    ls[pivot_index] = ls[end]

    while True:

        while lower <= upper and ls[lower] <  pivot: lower += 1
        while lower <= upper and ls[upper] >= pivot: upper -= 1
        if lower > upper: break

        ls[lower], ls[upper] = ls[upper], ls[lower]

    ls[end] = ls[lower]
    ls[lower] = pivot

    return lower

def qsort_range (ls, start, end):

    if end - start + 1 < 32:
        insertion_sort(ls, start, end)
    else:
        pivot_index = partition (ls, start, end, randint (start, end))
        qsort_range (ls, start, pivot_index - 1)
        qsort_range (ls, pivot_index + 1, end)

    return ls

def insertion_sort (ls, start, end):

    for idx in xrange (start, end + 1):
        el = ls[idx]
        for jdx in reversed (xrange(0, idx)):
            if ls[jdx] <= el:
                ls[jdx + 1] = el
                break
            ls[jdx + 1] = ls[jdx]
        else:
            ls[0] = el

    return ls

def quicksort (ls):

    return qsort_range (ls, 0, len (ls) - 1)

if __name__=='__main__':
    for value in 1000, 10000, 100000, 1000000, 10000000:
        ls = [randint (1, value) for _ in range(value)]
        ls2 = list(ls)
        last = -1
        start = clock()
        ls = radixsort(ls)
        end = clock()
        for i in ls:
            assert last <= i
            last = i
        print("rs %d: %0.2fs" % (value, end-start))
        tdiff = end-start
        start = clock()
        ls2 = quicksort(ls2)
        end = clock()
        last = -1
        for i in ls2:
            assert last <= i
            last = i
        print("qs %d: %0.2fs %0.2f%%" % (value, end-start, ((end-start)/tdiff*100)))

The output when I run this is:

C:\temp>c:\python27\python radixsort.py
rs 1000: 0.00s
qs 1000: 0.00s 212.98%
rs 10000: 0.02s
qs 10000: 0.05s 291.28%
rs 100000: 0.19s
qs 100000: 0.58s 311.98%
rs 1000000: 2.47s
qs 1000000: 7.07s 286.33%
rs 10000000: 31.74s
qs 10000000: 86.04s 271.08%

Edit: Just to clarify. The quicksort implementation here is very memory friendly, it sorts in-place so no matter how large the list it is just shuffling data around not copying it. The original radixsort effectively copies the list twice for each digit: once into the smaller lists and then again when you concatenate the lists. Using itertools.chain avoids that second copy but there's still a lot of memory allocation/deallocation going on. (Also 'twice' is approximate as list appending does involve extra copying even if it is amortized O(1) so I should maybe say 'proportional to twice'.)

OTHER TIPS

Your data representation is very expensive. Why do you use a hashmap for your buckets? Why use a base10 representation that you need to compute logarithms (= expensive to compute) for?

Avoid lambda expressions and such, I don't think python can optimize them very well yet.

Maybe start with sorting 10-byte strings for the benchmark instead. And: no Hashmaps and similar expensive datastructures.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow