Pregunta

I have several logfiles that I would like to read. Without loss of generality, let's say the logfile processing is done as follows:

def process(infilepath):
    answer = 0
    with open (infilepath) as infile:
        for line in infile:
            if line.startswith(someStr):
                answer += 1
    return answer

Since I have a lot of logfiles, I wanted to throw multiprocessing at this problem (my first mistake: I should have probably used multi-threading; someone please tell me why)

While doing so, it occurred to me that any form of parallel processing should be effectively useless here, since I'm constrained by the fact that there is only one read head on my HDD, and therefore, only one file may be read at a time. In fact, under this reasoning, due to the fact that lines from different files may be requested simultaneously, the read head may need to move significantly from time to time, causing the multiproc approach to be slower than a serial approach. So I decided to go back to a single process to read my logfiles.

Interestingly though, I noticed that I did get a speedup with small files (<= 40KB), and that it was only with large files (>= 445MB) that the expected slow-down was noticed.

This leads me to believe that python may read files in chunks, whose size exceeds more than the one line I request at a time.

Q1: So what is the file-reading mechanism under the hood?

Q2: What is the best way to optimize the reading of files from a conventional HDD?

Technical specs:

  • python3.3
  • 5400rpm conventional HDD
  • Mac OSX 10.9.2 (Mavericks)
¿Fue útil?

Solución

The observed behavior is a result of:

  1. BufferedIO
  2. a scheduling algorithm that decides the order in which the requisite sectors of the HDD are read

BufferedIO

Depending on the OS and the read block size, it is possible for the entire file to fit into one block, which is what is read in a single read command. This is why the smaller files are read more easily

Scheduling Algorithm

Larger files (filesize > read block size), have to be read in block size chunks. Thus, when a read is requested on each of several files (due to the multiprocessing), the needle has to move to different sectors (corresponding to where the files live) of the HDD. This repetitive movement does two things:

  1. increase the time between successive reads on the same file
  2. throw off the read-sector predictor, as a file may span multiple sectors

The time between successive reads of the same file matters if the computation performed on a chunk of lines completes before the read head can provide the next chunk of lines from the same file, the process simply waits until another chunk of lines becomes available. This is one source of slowdowns

Throwing off the read-predictor is bad for pretty much the same reasons as why throwing off the branch predictor is bad.

With the combined effect of these two issues, processing many large files in parallel would be slower than processing them serially. Of course, this is more true when processing blockSize many lines finishes before numProcesses * blockSize many lines can be read out of the HDD

Otros consejos

another idea would be to profile your code

try:
    import cProfile as profile
except ImportError:
    import profile

profile.run("process()")

here is an example of using a memory map file

import mmap 
with open("hello.txt", "r+b") as f:
     mapf = mmap.mmap(f.fileno(), 0)
     print(mapf.readline()) 
     mapf.close()
    enter code here
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top