Question

I am playing with concurrent.futures.ThreadPoolExecutor to see if I can squeeze more work out of my quad-core processor (with 8 logical cores). So I wrote the following code:

from concurrent import futures

def square(n):
    return n**2

def threadWorker(t):
    n, d = t
    if n not in d:
        d[n] = square(n)

def master(n, numthreads):
    d = {}
    with futures.ThreadPoolExecutor(max_workers=numthreads) as e:
        for i in e.map(threadWorker, ((i, d) for i in range(n))):
            pass  # done so that it actually fetches each result. threadWorker has its own side-effects on d
    return len(d)

if __name__ == "__main__":
    print('starting')
    print(master(10**6, 6))
    print('done')

The interesting thing is that the same functionality, when written in a for-loop takes about a second:

>>> d = {}
>>> for i in range(10**6):
...     if i not in d: d[i] = i**2

... while the threadpool code takes well over 10 seconds. Now I know that it's using at least 4 threads because I see the processor load on each of my cores. But even with shared memory (I can understand why processes might take a while, due to memory copying), I feel that this disparity in runtime is far too huge.

Does anyone have any ideas as to why this might take so long? It seems that a simple squaring operation, which is indeed highly parallelizable, should really not take so long. Could it perhaps be due to the population of the dictionary (if so, what is causing the slowdown there?)?

Technical details:

  • Python 3.3.3
  • quad-core (8 logical cores with hypertheading) CPU
  • MAC OSX 10.9.1 (Mavericks)
Was it helpful?

Solution 2

I've not yet tried futures, but I believe it's thread-based, so this probably applies: http://www.youtube.com/watch?v=ph374fJqFPE

In short, I/O bound workloads thread well in CPython, but CPU-bound workloads do not. And if you mix I/O bound and CPU-bound threads in the same process, that doesn't thread well either.

If that's the problem, I'd suggest increasing the size of your work chunks (just squaring a number is pretty small), and using multiprocessing. Multiprocessing is thread-like, but it uses multiple processes with shared memory, and tends to give looser coupling between program components than threading anyway.

That, or switch to Jython or IronPython; these reputedly thread well.

OTHER TIPS

Threads have overhead

Contrary to other answers, I'll claim that the main culprit here isn't the GIL (though that is an issue) but rather the overhead to using threads.

The overhead to spawning and switching between system level threads is small (less than 1ms) but still likely overwhelms the cost of squaring a single integer. Ideally you want to break your computation into larger pieces (perhaps square one million integers) when using parallelism of any sort.

Bypass the GIL

You can bypass the GIL if you use the numeric Python stack (NumPy/Pandas/C/Fortran/Cython/Numba). For example the following function would square an array of numbers and release the GIL.

import numpy as np
x = np.array(my_list)

import numba

@numba.jit(nogil=True)
def square(x):
    for i in range(len(x)):
        x[i] = x[i]**2
    return x

Or alternatively most numpy operations release the GIL

x = x**2

Memory bottleneck

No system will be able to use multiple cores while just squaring integers. Your CPUs are able to square integers far faster than your memory hierarchy is able to deliver them.

You're using async threads to try and make CPU-bound work concurrent? I wouldn't recommend it. Use processes instead, otherwise the GIL will slow things down more and more as the size of your thread pool increases.

[Edit 1]

Similar question with references to the GIL explanation from David Beazly (sp?).

Python code performance decreases with threading

Python has the global interpreter lock which doesn't let execute Python code of the same process in different threads simultaneously. To achieve true parallel execution you have to use multiple processes (easy to switch to ProcessPoolExecutor) or native (non-Python, e.g. C) code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top