Question

I have a list of numbers. I want to perform some time-consuming operation on each number in the list and make a new list with all the results. Here's a simplified version of what I have:

def calcNum(n):#some arbitrary, time-consuming calculation on a number
  m = n
  for i in range(5000000):
    m += i%25
    if m > n*n:
      m /= 2
  return m

nums = [12,25,76,38,8,2,5]
finList = []

for i in nums:
  return_val = calcNum(i)
  finList.append(return_val)

print(finList)

Now, I wanted to take advantage of the multiple cores in my CPU, and give each of them a task of processing one of the numbers, and since the "number calculation" function is self-contained from start to finish I figured this would be fairly simple to do and a perfect situation for multiprocessing/threading.

My question is, which one should I use (multiprocessing or threading?), and what is the simplest way to do this?

I did a test with various code I found in other questions to achieve this, and while it runs fine it doesn't seem to be doing any actual multithreading/processing and takes just as long as my first test:

from multiprocessing.pool import ThreadPool

def calcNum(n):#some arbitrary, time-consuming calculation on a number
  m = n
  for i in range(5000000):
    m += i%25
    if m > n*n:
      m /= 2
  return m

pool = ThreadPool(processes=3)

nums = [12,25,76,38,8,2,5]
finList = []

for i in nums:
  async_result = pool.apply_async(calcNum, (i,))
  return_val = async_result.get()
  finList.append(return_val)

print(finList)
Was it helpful?

Solution

multiprocessing.pool and pool.map are your best friends here. It saves a lot of headache as it hides all the other complex queues and whatnot you need to make it work. All you need to do is set up the pool, assign it the max number of processes, point it to the function and iterable. See working code below.

Because of the join and the usage cases pool.map was intended to work, the program will wait until ALL processes have returned something before giving you the result.

from multiprocessing.pool import Pool

def calcNum(n):#some arbitrary, time-consuming calculation on a number
  print "Calcs Started on ", n
  m = n
  for i in range(5000000):
    m += i%25
    if m > n*n:
      m /= 2
  return m

if __name__ == "__main__":
  p = Pool(processes=3)

  nums = [12,25,76,38,8,2,5]
  finList = []


  result = p.map(calcNum, nums)
  p.close()
  p.join()

  print result

That will get you something like this:

Calcs Started on  12
Calcs Started on  25
Calcs Started on  76
Calcs Started on  38
Calcs Started on  8
Calcs Started on  2
Calcs Started on  5
[72, 562, 5123, 1270, 43, 23, 23]

Regardless of when each process is started or when it completes, map waits for each to finish and then puts them all back in the correct order (corresponding to the input iterable).

As @Guy mentioned, the GIL hurts us here. You can change the Pool to ThreadPool in the code above and see how it affects the timing of the calculations. Since the same function is used, the GIL only allows one thread to use the calcNum function at a time. So it near enough still runs serially. Multirocessing with a process or pool essentially starts further instances of your script which gets around the issue of the GIL. If you watch your running processes during the above, you'll see extra instances of 'python.exe' start while the pool is running. In this case, you'll see a total of 4.

OTHER TIPS

I guess you are affected by python Global Interpreter Lock

The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations.

try to use multiprocessing instead

from multiprocessing import Pool
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top