Question

I have a question on the performance scalability with xargs. Currently I have a batch processing program written in python with multiprocessing and subprocess. Each process spawns an independent subprocess.popen() to execute an external command. Recently I realized that the whole process can be redone with xargs. However, I wonder whether it is a good idea to use xargs to process 10k+ files since I have never done something this scale with only commandline tools before. Given my test with small data sets, it is actually not a bad idea if all I am doing is batch running a bunch of commands, since it avoids many cycles of overhead imposed by python's modules, but I would like to learn more from anyone who may have more experience with xargs and python. More specifically, is there any buffer limit that I need to configure for xargs to consume a large number of inputs? Thanks.

Was it helpful?

Solution

The xargs program will collect multiple arguments from standard input, and glue them together to make one long command line. If there are many many many arguments, too long for one command line, then it will build and execute multiple command lines, as many as needed.

This means less overhead for starting up processes and shutting them down. How much good this will do for you depends on how long your processes run. If you are starting up some sort of CPU-intensive program that will run for half an hour, the startup time for the process will be inconsequential. If you are starting up a program that runs quickly, but you are only running a small number of instances, again the savings will be inconsequential. However, if your program is truly trivial and requires minimal runtime, maybe you will notice a difference.

From your problem description, it appears to be a good candidate for this. 10K things with relatively short processing for each. xargs might speed things up for you.

However, in my experience, doing any nontrivial work in shell scripts brings the pain. If you have any directory names or file names that can have a space in them, the slightest mistake in quoting your variables makes your script crash, so you need to obsessively test your script to make sure it will work for all possible inputs. For this reason, I do my nontrivial system scripts in Python.

Therefore, if you already have your program working in Python, IMHO you would be crazy to try to rewrite it as a shell script.

Now, you can still use xargs if you want. Just use subprocess to run xargs and pass all the arguments via standard input. This gains all of the benefit and none of the pain. You can use Python to stick a NUL byte chr(0) at the end of each argument, and then use xargs --null, and it will be robust with filenames that have spaces in them.

Alternatively you could use ' '.join() to build your own very long command lines, but I don't see any reason to do that when you can just run xargs as described above.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top