performance for xarg vs. python multiprocessing+subprocess

Question

The xargs program will collect multiple arguments from standard input, and glue them together to make one long command line. If there are many many many arguments, too long for one command line, then it will build and execute multiple command lines, as many as needed.

This means less overhead for starting up processes and shutting them down. How much good this will do for you depends on how long your processes run. If you are starting up some sort of CPU-intensive program that will run for half an hour, the startup time for the process will be inconsequential. If you are starting up a program that runs quickly, but you are only running a small number of instances, again the savings will be inconsequential. However, if your program is truly trivial and requires minimal runtime, maybe you will notice a difference.

From your problem description, it appears to be a good candidate for this. 10K things with relatively short processing for each. xargs might speed things up for you.

However, in my experience, doing any nontrivial work in shell scripts brings the pain. If you have any directory names or file names that can have a space in them, the slightest mistake in quoting your variables makes your script crash, so you need to obsessively test your script to make sure it will work for all possible inputs. For this reason, I do my nontrivial system scripts in Python.

Therefore, if you already have your program working in Python, IMHO you would be crazy to try to rewrite it as a shell script.

Now, you can still use xargs if you want. Just use subprocess to run xargs and pass all the arguments via standard input. This gains all of the benefit and none of the pain. You can use Python to stick a NUL byte chr(0) at the end of each argument, and then use xargs --null, and it will be robust with filenames that have spaces in them.

Alternatively you could use ' '.join() to build your own very long command lines, but I don't see any reason to do that when you can just run xargs as described above.