Faster way to export each page of a PDF into two images

https://stackoverflow.com/questions/21791268

11-10-2022
|

Question

The job is quite simple: I got a few hundred PDF documents and I need to export each page of them into 2 images: one big, one small.

After a couple hours of research and optimizations I came up with a neat Bash script to do it:

#!/bin/bash

FILE=$1
SLUG=$(md5 -q "$FILE")

mkdir -p $SLUG

gs -sDEVICE=jpeg -r216 -g1920x2278 -q -o $SLUG/%d.jpg "$FILE"

for IMAGE in $SLUG/*.jpg; do
  convert $IMAGE -resize 171x219 ${IMAGE/jpg/png}
done

As you can see, I...

Create a directory named with the MD5 of the file
Use GhostScript to extract each page of the PDF into a big JPEG
Use ImageMagick to create a smaller version of the JPG into a PNG

It works. But I'm afraid it's not fast enough.

I'm getting an average of .6s for each page (roughly 1 minute for a 80 page PDF) on my MacBook. But that script's going to run on a server, a much low end one - probably a micro EC2 with Ubuntu on Amazon.

Anyone got any tips, tricks or a lead to help me optimize this script ? Should I use another tool ? Are there better suited libraries for this kind of work ?

Unfortunately I don't write C or C++, but if you guys point some good libraries and tutorials I'll gladly learn it.

Thanks.

Update. I just tested it on a t1.micro instance on AWS. It took 10 minutes to process the same PDF with 80 pages. Also I noticed that convert was the slowest guy taking almost 5 minutes to resize the images.

Update. 2 I tested it now on a c1.medium instance. It's ~7x times the price of a t1.micro, but it came up very close to the performance of my MacBook: ~3.5 minutes for a document of 244 pages.

I'm gonna try mudraw and other combinations now.

La solution

You could just run GS twice, once for the big images and again for the smaller. Of course the output probably won't be as nice as convert would make, but at that size I'm guessing it won't be terribly obvious.

I've no idea how you would do it in a Bash script, but you could run 2 instances of Ghostscript (one for each size), which might be faster if the server is up to it.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow