Is there a hidden possible deadlock in ppmap/parallel python?

https://stackoverflow.com/questions/10352606

04-06-2021
|

문제

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).

The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.

If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).

e.g.

/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null

Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).

Is there any way to see where the problem is?

Sample of the code that I am running:

def analyse_repeats(data):
    """
    Loads whole proteome in memory and then looks for repeats in sequences, 
    flags both real repeats and sequences not containing particular aminoacid
    """    
    (organism, organism_id, filename) = data

    import re
    letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']

    try:
        handle = open(filename)
        data = Bio.SeqIO.parse(handle, "fasta")

        records = [record for record in data]
        store_records = []
        for record in records:
            sequence = str(record.seq)
            uniprot_id = str(record.name)
            for letter in letters:
                items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))     
                if items:
                    for item in items:
                        store_records.append((organism_id,len(item), uniprot_id, letter))
                else:
                    # letter not present in the string, "zero" repeat
                    store_records.append((organism_id,0, uniprot_id, letter))
        handle.close()
        return (organism,store_records)
    except IOError as e:
        print e
        return (organism, [])


res_generator = ppmap.ppmap(
    None, 
    analyse_repeats, 
    zip(todo_list, organism_ids, filenames)
)

for res in res_generator:  
    # process the output

If I use simple map instead of the ppmap, everything works fine:

res_generator = map(
    analyse_repeats, 
    zip(todo_list, organism_ids, filenames)
)

해결책

You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.

By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.

다른 팁

May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow