What is the best way to speed up a find command on a huge directory tree using GNU parallel?

https://stackoverflow.com/questions/23483162

16-07-2023
|

Question

I've been using GNU parallel for a while, mostly to grep large files or run the same command for various arguments when each command/arg instance is slow and needs to be spread out across cores/hosts.

One thing which would be great to do across multiple cores and hosts as well would be to find a file on a large directory subtree. For example, something like this:

find /some/path -name 'regex'

will take a very long time if /some/path contains many files and other directories with many files. I'm not sure if this is as easy to speed up. For example:

ls -R -1 /some/path | parallel --profile manyhosts --pipe egrep regex

something like that comes to mind but ls would be very slow to come up with the files to search. What's a good way then to speed up such a find?

Solution

If you have N hundred immediate subdirs, you can use:

 parallel --gnu -n 10 find {} -name 'regex' ::: *

to run find in parallel on each of them, ten at a time.

Note however that listing a directory recursively like this is an IO bound task, and the speedup you can get will depend on the backing medium. On a hard disk drive, it'll probably just be slower (if testing, beware disk caching).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow