Number of passes in external merge

Question 1

You ignore the fact that the total time it takes to read a block of data from disk is the sum of

The access time which is roughly constant and on the order of several milliseconds for rotating hard disk drives.
The transfer time which depends on the size of the data block and the transfer rate.

As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.

That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.

Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.

The total transfer time can be calculated easily. All the data has to read and written once per pass:

total_transfer_time = num_passes * 2 * data / transfer_rate

The total access time for buffer reads is:

total_access_time = num_passes * num_buffer_reads * access_time

Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size, buffer size is about ram / num_chunks for the one-pass approach, and the number of chunks is data / ram. So we have:

total_access_time1 = num_chunks^2 * access_time

For the two-pass solution, it makes sense to use sqrt(num_chunks) buffers to minimize access time. So buffer size is ram / sqrt(num_chunks) and we have:

total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
                   = 2 * num_chunks^1.5 * access_time

So if we use transfer_rate = 100 MB/s, access_time = 10 ms, data = 100 GB, ram = 1 GB, the total time is:

total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
            = 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
            = 4000 s + 20 s = 4020 s

The effect of access time is still very small. So let's change data to 1000 GB:

total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
            = 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
            = 40000 s + 632 s = 40632 s

Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:

total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
            = 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
            = 200000 s + 7071 s = 207071 s

Now the two-pass version is faster.

Question 2

To get an optimum you need a more sophisticated model of the disk. Let time to fill a block of size S be rS + k where k is seek time and r is read rate.

If you divide RAM of size M into C+1 buffers of size M/(C+1), then the time to load RAM once is (C+1) (r M/(C+1) + k) = rM + k(C+1). So as you'd expect, making C smaller speeds up read time by eliminating seeks. It's fastest to read all of memory in one sequential block, but merging doesn't allow it. We must make a tradeoff. That's where we need to look for the optimum.

With total data size of c times RAM size, there are c chunks to be merged.

In the one pass scheme, C=c, and the total read time must be just the time to fill RAM c times over: c (rM + k(c+1)) = c(rM + kc + k).

In the two pass scheme with an N-way division of data for the first pass, that pass has C=c/N and in the second pass, C=N. So total cost is

c ( rM + k(c/N+1) ) + c ( rM + k(N+1) ) = c ( 2rM + k(c/N + N) + 2k )

Note this model omits write time. You should fill that in eventually unless you're assuming it's overlapped I/O on a different device and thus can be ignored.

It's not hard to see here that if c and k are suitably large, then the c/N+N term in the 2-pass model can be so small compared to the c in the one-pass that the 2-pass model will be faster.

I'm going to stop now, but you can carry this logic on to (probably) get a closed approximation formula for an arbitrary number of passes. THis will require solving an infinite series. Then you can set the derivative to zero and solve for an estimate of the optimal pass number. If life is good you'll also learn the optimal value of N by setting the gradient of a 2d function in pass number and N to zero. My intuition says N ~ sqrt(c).

If the math gets intractable, you could still simulate a reasonable range of numbers of passes with the kind of simple algebra above at the start and pick an optimum that way.

This is an interesting problem and I'm sorry I don't have more time to work on it at the moment. I hope the analysis framework is enough to let you punch through to a nice result.