Puzzling Parallel Python Problem - TRANSPORT_SOCKET_TIMEOUT

https://stackoverflow.com/questions/4102380

29-09-2019
|

문제

The following code doesn't appear to work properly for me. It requires starting a ppserver on another computer on your network, for example with the following command:

ppserver.py -r -a -w 4

Once this server is started, on my machine I run this code:

import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
    r = x
    for i in xrange(10**9):
        r+=1
    f = open('/home/tomb/statusfile.txt', 'a')
    f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
    return r

jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))

for job in jobs:
    print job()
print 'done'

The odd part: Watching the /home/tomb/statusfile.txt, I can see that it's getting written to several times, as though the function is being run several times. I've observed this continuing for over an hour before, and never seen a job() return.

Odder: If I change the number of iterations in the testfunc definition to 10**8, the function is just run once, and returns a result as expected!

Seems like some kind of race condition? Just using local cores works fine. This is with pp v 1.6.0 and 1.5.7.

Update: Around 775,000,000: I get inconsistent results: two jobs repeat once, on finishes the first time.

Week later update: I've written my own parallel processing module to get around this, and will avoid parallel python in the future, unless someone figures this out - I'll get around to looking at it some more (actually diving into the source code) at some point.

Months later update: No remaining hard feelings, Parallel Python. I plan to move back as soon as I have time to migrate my application. Title edit to reflect solution.

해결책

Answer from Bagira of the Parallel Python forum:

How long does the calculation of every job take? Have a look at the variable TRANSPORT_SOCKET_TIMEOUT in /usr/local/lib/python2.6/dist-packages/pptransport.py.

Maybe your job takes longer than the time in the variable above. Increase the value of it and try.

Turns out this was exactly the problem. In my application I'm using PP as a batch scheduler of jobs that can take several minutes, so I need to adjust this. (the default was 30s)

다른 팁

It may be that the library allows duplicates as some nodes lag behind there will be a long tail of remaining tasks to complete. By duplicating the tasks, it can bypass the slow nodes and you should just take the result that finishes first. You can get around this by including a unique id for each task and accept only the first one to return for each.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow