Question

The following takes 24 minutes to run on a file with 47,000 entries (8 cores windows 7, running Cygwin):

cat File_Path.txt | parallel --progress --tag -j +0 'pdftotext {} 2>/dev/null - | wc -w;' > results.txt

I am taking each line in File_Path.txt converting from a pdf to a text file and counting the words therein. Is there any way I can shave off some time in the processing?

Was it helpful?

Solution

I ran your script on 148 random PDF files. That took 41 cpu seconds = 0.27 cpuseconds per file. So a rough guestimate of what you should expect is in the order of 1700 seconds on an 8-core machine, which is what you see, too. So I will assume you see 100% CPU utilization during the 24 minutes.

GNU Parallel spends less than 0.01 second per job, so most of the time is spent running pdftotext. You will therefore gain most by using a faster tool than pdftotext. Unfortunately I do not know of such a tool.

--edit--

If you have ssh access to other servers, you can use their CPUs, too. See the tutorial http://www.gnu.org/software/parallel/parallel_tutorial.html#remote_execution on how to do that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top