Question

I'm making a small pipeline for chewing through a large amount of data, and I've decided to use python to call the program using multiple cores.

So here's my questions:

1) The program outputs a very big text file. I only wish to save the output to a new file (so not save the string as a python object), what's the best way to do this using the subprocess module?

2) I wish to call the program many times in parallel using the multiprocess module. I normally just go the simple way and use the Pool.map function, will this interfere with the subprocess module?

Thanks in advance!

Was it helpful?

Solution

1) The program outputs a very big text file. I only wish to save the output to a new file (so not save the string as a python object), what's the best way to do this using the subprocess module?

If you look at the documentation, valid values for stdout are:

PIPE, an existing file descriptor (a positive integer), an existing file object, and None.

So:

with open('new_file.txt', 'w') as outfile:
    subprocess.call(['program', 'arg'], stdout=outfile)

2) I wish to call the program many times in parallel using the multiprocess module. I normally just go the simple way and use the Pool.map function, will this interfere with the subprocess module?

Not unless you do certain odd things.

multiprocessing.Pool keeps track of which processes it created, and won't try to manage other child processes that happen to get created elsewhere, so the obvious thing you're worried about isn't an issue.

The most common problem I've seen is using Popen to create child processes that you never reap. You'll often get away with this in an app without multiprocessing, but as soon as you do the Popen-and-leak in a pool task, you stop getting away with it. (This isn't really anything about multiprocessing or Python; it's just that grandchild processes aren't the same as child processes.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top