Question

I use Abbyy FineReader for ScanSnap to OCR a couple of scanned PDF files. The software claims it retains the original PDF images. The PDF file sizes pre-OCR and post-OCR are almost identical, which is good.

After the software is done, all PDF images appear anti-aliased in Acrobat X. Page navigation is much slower than before, and when I zoom in/out, the images first go to what looks like the pre-anti-aliasing version before quickly changing to anti-aliased images.

Left: Scanned PDF / Right: after OCR with Abbyy enter image description here

I would like to get the original images without anti-aliasing back. Interestingly, when I open a single page from the anti-aliased PDF in Photoshop, there is no anti-aliasing and the image looks like the left one.

My limited PDF programming experience leads me to believe that Abbyy likely sets some kind of anti-alias flag for each image during OCR processing. How do I un-set this flag?

Any pointers to useful ideas would be much appreciated.

Was it helpful?

Solution

There is /Interpolate true entry in image dictionary of OCR-ed version, and that's what causes 'anti-aliasing'. Whether that (and not JPEG2000 instead of JPEG compression) is a cause of slow-down, you check on large enough files.

To un-set this key, the best would be to turn it off while creating a file, and if that's not possible, to write and run a small program in suitable language.

But, since your file doesn't sport 'compressed objects' and offending key is in plain view inside a file, in the spirit of 'job done quickly' you can simply process your file e.g. like this:

perl -M-encoding -0777pe "s!/Interpolate true!' 'x17!ge" <in.pdf >out.pdf

OTHER TIPS

After the software is done, all PDF images appear anti-aliased in Acrobat X. Page navigation is much slower than before, and when I zoom in/out, the images first go to what looks like the pre-anti-aliasing version before quickly changing to anti-aliased images.

Actually in the original file 2013_11_15_22_51_31.pdf contains a JPEG image while the OCR'ed file 2013_11_15_22_51_31_OCR.pdf contains a JPEG2000 image.

Comparing them in third party viewers, it becomes clear that the image in the OCR'ed file is not inherently anti-alias'ed. Furthermore there is no evident flag in the PDF instructing PDF viewers to apply anti-aliasing to the JPEG2000 image. Thus, Adobe Reader seems to automatically render JPEG and JPEG2000 images differently, applying anti-aliasing to the latter but not to the former.

Comparing both images in detail, though, it becomes clear that these images are not identical but instead the image in the OCR'ed PDF is slightly rotated.

I assume Abbyy FineReader recognized that the original scanned image is not correctly oriented. Thus, it rotated it slightly to correct this orientation.

Thus, replacing the image in the OCR'ed version with the one from the original one is no option: Due to the rotation the OCR information would partially be somewhat off.

What you might want to try is to recode the JPEG2000 image to JPEG and replace the image in the OCR'ed version with this recoded one. This will mean some loss of quality but most likely you can get rid of the anti-aliasing this way.

Be aware, though, that the JPEG2000 image is slightly larger than the JPEG image to accomodate for the rotation.

PS: As @VadimR pointed out, there is indeed an /Interpolate true entry in the image dictionary of the OCR-ed version I missed when looking at the file. This does not seem to be the major issue slowing down the rendering.

The original JPEG

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top