Вопрос

There doesn't seem to be any preexisting questions on this, at least from a title search. I am seeking to find the optimal amount of passes for an external merge. So, if we have 1000 chunks of data, one pass would be a 1000 way merge. Two pass could be 5 groups of 200 chunks, then a final merge of 1 group of 5 chunks. And so on. I've done some math, which must have a flaw, because it looks like two passes never beats one pass. It could very well be a misunderstanding in how data is read, though.

First, a numerical example:

Data: 100 GB
Ram: 1 GB

Since we have 1GB memory, we can load in 1GB at a time to sort using quicksort or mergesort. Now we have 100 chunks to sort. We can do a 100 way merge. This is done by making RAM/(chunks+1) size buckets = 1024MB/101 = 10.14MB. There are 100 10.14MB buckets for each of the 100 chunks, and one output bucket also of size 10.14MB. As we merge, if any input buckets empty, we do a disk seek to refill that bucket. Likewise, when the output bucket gets full, we write to the disk and empty it. I claim that the number of "times the disk needs to read" is (data/ram)*(chunks+1). I get this from the fact that we have ram/(chunks+1) sized input buckets, and we must read in the entire data for a given pass, so we read (data/bucket_size) times. In other words, every time an input bucket empties we must refill it. We do this over 100 chunks here, so numChunks*(chunk_size/bucket_size) = datasize/bucket_size or 100*(1024MB/10.14MB). BucketSize = ram/(chunks+1) so 100*(1024/10.14) = (data/ram) * (chunks+1) = 1024*100MB/1024MB * 101 = 10100 reads.

For a two pass system, we do A groups of B #chunks, then a final merge of 1 group of A #chunks. Using previous logic, we have numReads = A*( (data/ram)*(B+1)) + 1*( (data/ram)*(A+1)). We also have A*B = Data/Ram. For instance, 10 groups of 10 chunks, where each chunk is a GB. Here, A = 10 B = 10. 10*10 = 100/1 = 100, which is Data/Ram. This is because Data/Ram was the original number of chunks. For 2 pass, we want to break Data/Ram into A groups of B #chunks.

I'll try to break down the formula here, let D = data, A = #groups, B = #chunks/group, R = RAM

A*(D/R)*(B+1) + 1*(D/R)*(A+1) - This is A times the number of reads of an external merge on B #chunks plus the final merge on A #chunks.

A = D/(R*B) => D^2/(B*R^2) * (B+1) + D/R * (D/(R*B)+1)

(D^2/R^2)*[1 + 2/B] + D/R is number of reads for a 2 pass external merge. For 1 pass, we have (data/ram)*(chunks+1) where chunks = data/ram for 1 pass. Thus, for one pass we have D^2/R^2 + D/R. We see that a 2 pass only reaches that as the chunk size B goes to infinity, and even then the additional final merge gives us D^2/R^2 + D/R. So there must be something about the reads I'm missing, or my math is flawed. Thanks to anyone who takes the time to help me!

Это было полезно?

Решение

You ignore the fact that the total time it takes to read a block of data from disk is the sum of

  • The access time which is roughly constant and on the order of several milliseconds for rotating hard disk drives.
  • The transfer time which depends on the size of the data block and the transfer rate.

As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.

That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.

Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.

The total transfer time can be calculated easily. All the data has to read and written once per pass:

total_transfer_time = num_passes * 2 * data / transfer_rate

The total access time for buffer reads is:

total_access_time = num_passes * num_buffer_reads * access_time

Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size, buffer size is about ram / num_chunks for the one-pass approach, and the number of chunks is data / ram. So we have:

total_access_time1 = num_chunks^2 * access_time

For the two-pass solution, it makes sense to use sqrt(num_chunks) buffers to minimize access time. So buffer size is ram / sqrt(num_chunks) and we have:

total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
                   = 2 * num_chunks^1.5 * access_time

So if we use transfer_rate = 100 MB/s, access_time = 10 ms, data = 100 GB, ram = 1 GB, the total time is:

total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
            = 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
            = 4000 s + 20 s = 4020 s

The effect of access time is still very small. So let's change data to 1000 GB:

total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
            = 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
            = 40000 s + 632 s = 40632 s

Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:

total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
            = 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
            = 200000 s + 7071 s = 207071 s

Now the two-pass version is faster.

Другие советы

To get an optimum you need a more sophisticated model of the disk. Let time to fill a block of size S be rS + k where k is seek time and r is read rate.

If you divide RAM of size M into C+1 buffers of size M/(C+1), then the time to load RAM once is (C+1) (r M/(C+1) + k) = rM + k(C+1). So as you'd expect, making C smaller speeds up read time by eliminating seeks. It's fastest to read all of memory in one sequential block, but merging doesn't allow it. We must make a tradeoff. That's where we need to look for the optimum.

With total data size of c times RAM size, there are c chunks to be merged.

In the one pass scheme, C=c, and the total read time must be just the time to fill RAM c times over: c (rM + k(c+1)) = c(rM + kc + k).

In the two pass scheme with an N-way division of data for the first pass, that pass has C=c/N and in the second pass, C=N. So total cost is

c ( rM + k(c/N+1) ) + c ( rM + k(N+1) ) = c ( 2rM + k(c/N + N) + 2k )

Note this model omits write time. You should fill that in eventually unless you're assuming it's overlapped I/O on a different device and thus can be ignored.

It's not hard to see here that if c and k are suitably large, then the c/N+N term in the 2-pass model can be so small compared to the c in the one-pass that the 2-pass model will be faster.

I'm going to stop now, but you can carry this logic on to (probably) get a closed approximation formula for an arbitrary number of passes. THis will require solving an infinite series. Then you can set the derivative to zero and solve for an estimate of the optimal pass number. If life is good you'll also learn the optimal value of N by setting the gradient of a 2d function in pass number and N to zero. My intuition says N ~ sqrt(c).

If the math gets intractable, you could still simulate a reasonable range of numbers of passes with the kind of simple algebra above at the start and pick an optimum that way.

This is an interesting problem and I'm sorry I don't have more time to work on it at the moment. I hope the analysis framework is enough to let you punch through to a nice result.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top