Parallelize for-loop in python

https://stackoverflow.com/questions/21751796

11-10-2022
|

문제

I have a simple set of code that runs Clustal Omega (a protein multiple sequence alignment program) from Python:

from Bio.Align.Applications import ClustalOmegaCommandline

segments = range(1, 9)
segments.reverse()

for segment in segments:
    in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
    out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
    distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment

    cline = ClustalOmegaCommandline(infile=in_file, 
                                    outfile=out_file, 
                                    distmat_out=distmat, 
                                    distmat_full=True, 
                                    verbose=True,
                                    force=True)
    print cline
    cline()

I've done some informal tests at timing how long my multiple sequence alignments (MSAs) take. On average, each one takes 4 hours. To run all 8 one after another took me 32 hours in total. Therefore, that was my original intent in running it as a for loop - that I could let it run and not worry about it.

However, I did yet another informal test - I took the output from the printed cline, and copied-and-pasted it into 8 separate terminal windows spread across two computers, and ran the MSAs that way. On average, each one took about 8 hours or so... but because they were all running at the same time, it took me only 8 hours to get the results.

In some ways, I've discovered parallel processing! :D

But I'm now faced with the dilemma of how to get it running in Python. I've tried looking at the following SO posts, but I still cannot seem to wrap my head around how the multiprocessing module works.

List of posts:

Would anybody be kind enough to share how they would parallelize this loop? Many loops I do look similar to this loop, in which I perform some action on a file and write to another file, without ever needing to aggregate the results in memory. The specific difference I am facing is the need to do file I/O, rather than aggregate results from parallel runs of the loop.

해결책

Possibly the Joblib library is what you are looking for.

Let me give you an example of its use:

import time
from joblib import Parallel, delayed


def long_function():
    time.sleep(1)


REPETITIONS = 4
Parallel(n_jobs=REPETITIONS)(
    delayed(long_function)() for _ in range(REPETITIONS))

This code runs in 1 second, instead of 4 seconds.

Adapting your code looks like this (sorry, I can't test if this is correct):

from joblib import Parallel, delayed

from Bio.Align.Applications import ClustalOmegaCommandline


def run(segment):
    in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
    out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
    distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
    cline = ClustalOmegaCommandline(infile=in_file,
                                    outfile=out_file,
                                    distmat_out=distmat,
                                    distmat_full=True,
                                    verbose=True,
                                    force=True)
    print cline
    cline()


if __name__ == "__main__":
    segments = range(1, 9)
    segments.reverse()

    Parallel(n_jobs=len(segments)(
        delayed(run)(segment) for segment in segments)

다른 팁

Instead of for segment in segments, write def f(segment) and then use multiprocessing.Pool().map(f, segments)

Figuring out how to put this in context is left as an exercise to the reader.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow