The job is quite simple: I got a few hundred PDF documents and I need to export each page of them into 2 images: one big, one small.
After a couple hours of research and optimizations I came up with a neat Bash script to do it:
#!/bin/bash
FILE=$1
SLUG=$(md5 -q "$FILE")
mkdir -p $SLUG
gs -sDEVICE=jpeg -r216 -g1920x2278 -q -o $SLUG/%d.jpg "$FILE"
for IMAGE in $SLUG/*.jpg; do
convert $IMAGE -resize 171x219 ${IMAGE/jpg/png}
done
As you can see, I...
- Create a directory named with the MD5 of the file
- Use GhostScript to extract each page of the PDF into a big JPEG
- Use ImageMagick to create a smaller version of the JPG into a PNG
It works. But I'm afraid it's not fast enough.
I'm getting an average of .6s for each page (roughly 1 minute for a 80 page PDF) on my MacBook. But that script's going to run on a server, a much low end one - probably a micro EC2 with Ubuntu on Amazon.
Anyone got any tips, tricks or a lead to help me optimize this script ? Should I use another tool ? Are there better suited libraries for this kind of work ?
Unfortunately I don't write C or C++, but if you guys point some good libraries and tutorials I'll gladly learn it.
Thanks.
Update.
I just tested it on a t1.micro instance on AWS. It took 10 minutes to process the same PDF with 80 pages. Also I noticed that convert
was the slowest guy taking almost 5 minutes to resize the images.
Update. 2
I tested it now on a c1.medium instance. It's ~7x times the price of a t1.micro, but it came up very close to the performance of my MacBook: ~3.5 minutes for a document of 244 pages.
I'm gonna try mudraw and other combinations now.