What is the worst case complexity for bucket sort?

https://stackoverflow.com/questions/9792132

25-05-2021
|

Question

I just read the Wikipedia page about Bucket sort. In this article they say that the worst case complexity is O(n²). But I thought the worst case complexity was O(n + k) where k are the number of buckets. This is how I calculate this complexity:

Add the element to the bucket. Using a linked list this is O(1)
Going through the list and put the elements in the correct bucket = O(n)
Merging the buckets = O(k)
O(1) * O(n) + O(k) = O(n + k)

Am I missing something?

Solution

What if the algorithm decides that every element belongs in the same bucket? In that case, the linked list in that bucket needs to be traversed every time an element is added. That takes 1 step, then 2, then 3, 4, 5... n . Thus the time is the sum of all of the numbers from 1 to n which is (n^2 + n)/2, which is O(n^2).

Of course, this is "worst case" (all the elements in one bucket) - the algorithm to calculate which bucket to place an element is generally designed to avoid this behavior.

OTHER TIPS

In order to merge the buckets, they first need to be sorted. Consider the pseudocode given in the Wikipedia article:

function bucketSort(array, n) is
  buckets ← new array of n empty lists
  for i = 0 to (length(array)-1) do
    insert array[i] into buckets[msbits(array[i], k)]
  for i = 0 to n - 1 do
    nextSort(buckets[i])
  return the concatenation of buckets[0], ..., buckets[n-1]

The nextSort(buckets[i]) sorts each of the individual buckets. Generally, a different sort is used to sort the buckets (i.e. insertion sort), as once you get down and size, different, non-recursive sorts often give you better performance.

Now, consider the case where all n elements end up in the same bucket. If we use insertion sort to sort individual buckets, this could lead to the worst case performance of O(n^2). I think the answer must be dependent on the sort you choose to sort the individual buckets.

If you can guarantee that each bucket represents a unique value (equivalent items), then the worst case time complexity would be O(m+n) as you pointed out.

Bucket sort assumes that the input is drawn from a uniform distribution. This implies that a few items fall in each bucket. In turn, this leads to a nice average running time of O(n). Indeed, if the n elements are inserted in each bucket so that O(1) elements fall in each different bucket (insertion requires O(1) per item), then sorting a bucket using insertion sort requires, on average, O(1) as well (this is proved in almost all textbooks on algorithms). Since you must sort n buckets, the average complexity is O(n).

Now, assume that the input is not drawn from a uniform distribution. As already pointed out by @mfrankli, this may lead in the worst case to a situation in which all of the items fall for example all in the first bucket. In this case, insertion sort will require in the worst case O(n^2).

Note that you may use the following trick to maintain the same average O(n) complexity, while providing an O(n log n) complexity in the worst case. Instead of using insertion sort, simply use an algorithm with O(n log n) complexity in the worst case: either merge sort or heap sort (but not quick sort, which achieves O(n log n) only on average).

This is an add-on answer to @perreal. I tried to post it as a comment but it's too long. @perreal is correctly pointing out when bucket sort makes the most sense. The different answers are making different assumptions about what data is being sorted. E.G. if the keys to be sorted are strings, then the range of possible keys will be too large (larger than the bucket array), and we will have to only use the first character of the string for the bucket positions or some other strategy. The individual buckets will have to be sorted because they hold items with different keys, leading to O(n^2).

But if we are sorting data where the keys are integers in a known range, then the buckets are always already sorted because the keys in the bucket are equal, which leads to the linear time sort. Not only are the buckets sorted, but the sort is stable because we can pull items out of the bucket array in the order they were added.

The thing that I wanted to add is that if you are facing O(n^2) because of the nature of the keys to be sorted, bucket sort might not be the right approach. When you have a range of possible keys that is proportional to the size of the input, then you can take advantage of the linear time bucket sort by having each bucket hold only 1 value of a key.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow