The observed behavior is a result of:
- BufferedIO
- a scheduling algorithm that decides the order in which the requisite sectors of the HDD are read
BufferedIO
Depending on the OS and the read block size, it is possible for the entire file to fit into one block, which is what is read in a single read command. This is why the smaller files are read more easily
Scheduling Algorithm
Larger files (filesize > read block size), have to be read in block size
chunks. Thus, when a read is requested on each of several files (due to the multiprocessing), the needle has to move to different sectors (corresponding to where the files live) of the HDD. This repetitive movement does two things:
- increase the time between successive reads on the same file
- throw off the read-sector predictor, as a file may span multiple sectors
The time between successive reads of the same file matters if the computation performed on a chunk of lines completes before the read head can provide the next chunk of lines from the same file, the process simply waits until another chunk of lines becomes available. This is one source of slowdowns
Throwing off the read-predictor is bad for pretty much the same reasons as why throwing off the branch predictor is bad.
With the combined effect of these two issues, processing many large files in parallel would be slower than processing them serially. Of course, this is more true when processing blockSize
many lines finishes before numProcesses * blockSize
many lines can be read out of the HDD