You are pushing jobs out then wait for them to complete:
yield p.apply_async(get_hash, (full_name,)).get()
The AsyncResult.get()
method blocks until the job is done, you are effectively running your jobs sequentially.
Collect the jobs, poll them with AsyncResult.ready()
until they are done, then .get()
the results.
Better still, push all jobs into the pool with .async_apply()
calls, then close the pool, call .join()
(which blocks until all jobs are done), then retrieve results with .get()
:
def hasher_parallel(path=PATH):
p = multiprocessing.Pool(3)
jobs = []
for root, dirs, files in os.walk(path, topdown=False):
for name in files:
full_name = os.path.join(root, name)
jobs.append(p.apply_async(get_hash, (full_name,)))
p.close()
p.join() # wait for jobs to complete
for job in jobs:
yield job.get()
You could use the Pool.imap()
method to simplify the code somewhat; it'll yield results as they come available:
def hasher_parallel(path=PATH):
p = multiprocessing.Pool(3)
filenames = (os.path.join(root, name)
for root, dirs, files in os.walk(path, topdown=False)
for name in files)
for result in p.imap(get_hash, filenames):
yield result
but do experiment with the chunksize parameter and the unordered variant as well.