The following takes 24 minutes to run on a file with 47,000 entries (8 cores windows 7, running Cygwin):

cat File_Path.txt | parallel --progress --tag -j +0 'pdftotext {} 2>/dev/null - | wc -w;' > results.txt

I am taking each line in File_Path.txt converting from a pdf to a text file and counting the words therein. Is there any way I can shave off some time in the processing?

有帮助吗?

解决方案

I ran your script on 148 random PDF files. That took 41 cpu seconds = 0.27 cpuseconds per file. So a rough guestimate of what you should expect is in the order of 1700 seconds on an 8-core machine, which is what you see, too. So I will assume you see 100% CPU utilization during the 24 minutes.

GNU Parallel spends less than 0.01 second per job, so most of the time is spent running pdftotext. You will therefore gain most by using a faster tool than pdftotext. Unfortunately I do not know of such a tool.

--edit--

If you have ssh access to other servers, you can use their CPUs, too. See the tutorial http://www.gnu.org/software/parallel/parallel_tutorial.html#remote_execution on how to do that.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top