In SPMD using GNU parallel, is processing the smallest files first the most efficient way?

https://stackoverflow.com/questions/23068039

03-07-2023
|

Question

This is pretty straight forward:

Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:

ls --sort=size data/* | tac | parallel ./proc

which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?

I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!

Solution

If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.

Here are 8 jobs: 7 take 1 second, one takes 5:

1 2 3 4 55555 6 7 8

On a dual core small jobs first:

1368
24755555

On a dual core big jobs first:

555557
123468

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow