Parallel Disk I/O

Question 1

The observed behavior is a result of:

BufferedIO
a scheduling algorithm that decides the order in which the requisite sectors of the HDD are read

BufferedIO

Depending on the OS and the read block size, it is possible for the entire file to fit into one block, which is what is read in a single read command. This is why the smaller files are read more easily

Scheduling Algorithm

Larger files (filesize > read block size), have to be read in block size chunks. Thus, when a read is requested on each of several files (due to the multiprocessing), the needle has to move to different sectors (corresponding to where the files live) of the HDD. This repetitive movement does two things:

increase the time between successive reads on the same file
throw off the read-sector predictor, as a file may span multiple sectors

The time between successive reads of the same file matters if the computation performed on a chunk of lines completes before the read head can provide the next chunk of lines from the same file, the process simply waits until another chunk of lines becomes available. This is one source of slowdowns

Throwing off the read-predictor is bad for pretty much the same reasons as why throwing off the branch predictor is bad.

With the combined effect of these two issues, processing many large files in parallel would be slower than processing them serially. Of course, this is more true when processing blockSize many lines finishes before numProcesses * blockSize many lines can be read out of the HDD

Question 2

another idea would be to profile your code

try:
    import cProfile as profile
except ImportError:
    import profile

profile.run("process()")

Question 3

here is an example of using a memory map file

import mmap 
with open("hello.txt", "r+b") as f:
     mapf = mmap.mmap(f.fileno(), 0)
     print(mapf.readline()) 
     mapf.close()
    enter code here