file.read() multiprocessing and the GIL

https://stackoverflow.com/questions/12213957

29-06-2021
|

题

I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.

I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.

Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?

So given my function to read and pipe the data to be processed:

def read(file_path, pipe_out):
    with open(file_path, 'rb') as file_:
        while True:
            block = file_.read(block_size)
            if not block:
                break
            pipe_out.send(block)
    pipe_out.close()

I reckon that this will definitely make use of multiple cores, but also introduces some overhead:

multiprocess.Process(target=read, args).start()

But now I'm wondering if just doing this will also use multiple cores, minus the overhead:

read(*args)

Any insights anybody has as to which one would be faster and for what reason would be much appreciated!

解决方案

Okay, as came out by the comments, the actual question is:

Does (C)Python create threads on its own, and if so, how can I make use of that?

Short answer: No.

But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.

Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).

I hope that cleared up the issue a bit.

其他提示

I think this is the main part of your question:

The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?

I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow