Median of medians algorithm: why divide the array into blocks of size 5

Question 1

I think that if you'll check "Proof of O(n) running time" section of wiki page for medians-of-medians algorithm:

The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time

The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time). From this, using induction, one can easily show that

That should help you to understand, why.

Question 2

The number has to be larger than 3 (and an odd number, obviously) for the algorithm. 5 is the smallest odd number larger than 3. So 5 was chosen.

Question 3

You can also use blocks of size 3 or 4, as shown in the paper Select with groups of 3 or 4 by K. Chen and A. Dumitrescu (2015). The idea is to use the "median of medians" algorithm twice and partition only after that. This lowers the quality of the pivot but is faster.

So instead of:

T(n) <= T(n/3) + T(2n/3) + O(n)
T(n) = O(nlogn)

one gets:

T(n) <= T(n/9) + T(7n/9) + O(n)
T(n) = Theta(n)

Question 4

See this explanation on Brilliant.org. Basically, five is the smallest possible array we can use to maintain linear time. It is also easy to implement a linear sort with an n=5 sized array. Apologies for the laTex:

Why 5?

The median-of-medians divides a list into sublists of length five to get an optimal running time. Remember, finding the median of small lists by brute force (sorting) takes a small amount of time, so the length of the sublists must be fairly small. However, adjusting the sublist size to three, for example, does change the running time for the worse.

If the algorithm divided the list into sublists of length three, pp would be greater than approximately \frac{n}{3} 3 n elements and it would be smaller than approximately \frac{n}{3} 3 n elements. This would cause a worst case \frac{2n}{3} 3 2n recursions, yielding the recurrence T(n) = T\big( \frac{n}{3}\big) + T\big(\frac{2n}{3}\big) + O(n),T(n)=T( 3 n )+T( 3 2n )+O(n), which by the master theorem is O(n \log n),O(nlogn), which is slower than linear time.

In fact, for any recurrence of the form T(n) \leq T(an) + T(bn) + cnT(n)≤T(an)+T(bn)+cn, if a + b < 1a+b<1, the recurrence will solve to O(n)O(n), and if a+b > 1a+b>1, the recurrence is usually equal to \Omega(n \log n)Ω(nlogn). [3]

The median-of-medians algorithm could use a sublist size greater than 5—for example, 7—and maintain a linear running time. However, we need to keep the sublist size as small as we can so that sorting the sublists can be done in what is effectively constant time.