Question

I recently bought an Epson scanner so I can start digitizing a mountain of documents I've accumulated over the years. I've already learned how to scan documents into PDF's. However, I want to make sure my PDF's have searchable text - I think the technical term is OCR, but I'm thoroughly confused.

I can scan files into PDF's using my scanner alone. But if I understand correctly, I can't make them OCR searchable unless I make Adobe Acrobat and/or ABBYY Fine Reader part of the workflow. (I'm using a Mac running Mavericks, by the way.)

I guess the the first thing I need to ask is this: What software do I need for creating a PDF that's OCR searchable? Like I said, I already have the Epson scanner software installed, but it looks like I also need Acrobat and/or ABBYY Fine Reader.

I guess a second question I should ask is how do I know if a PDF has searchable text? Could I simply search for a word or phrase on a PDF page with a standard program like Dreamweaver or Apple's Spotlight? Thanks.

Was it helpful?

Solution

The scanner produces an image and saves it either in an image format or as PDF. Then you open the result in OCR software, such as ABBYY Fine Reader. You can also open it in Acrobat, as Acrobat itself has OCR components built in. If you were using Acrobat, you have a searchable document, unless Acrobat was unable to locate any readable character. Other OCR software may save a PDF, or another file format.

Another product has been mentioned in another answer; I don't know it, but it might be worthwhile having a look at it.

For the second question:

a) There is an Acrobat JavaScript Doc object method getPageNumWords(); if this methods returns a number greater than 0, the page you passed as argument has searchable text. You find more information about this method in the Acrobat JavaScript documentation, which is part of the Acrobat SDK, downloadable from the Adobe website.

b) There is a preflight check which finds out whether the page/document has Text objects. If so, it has searchable text. You will need Acrobat Pro, for this, however.

OTHER TIPS

You can scan to multiple-page TIFF image and let Tesseract 3.03 create searchable PDF for you.

Most solutions are to use the scanner to generate an image file (like a nonsearchable PDF), then to move your body from your scanner over to your computer, log in, run some unwieldy outrageously priced software called ABBSGDS or something, click a ton of menu buttons, respond to a ton of dialogue boxes, twiddle your thumbs as you watch the OCR progress bar, and voila--a searchable PDF.

Or, you can get a Canon scanner (e.g. DR-M160) and use their free CaptureOnTouch software. In that case, you put a document in the scanner, choose a number on the scanner, and press scan. A few seconds later (even on a slow computer) a fully OCRd searchable PDF will be in the directory programmed to the number you selected. You never even have to touch your computer (although it must be on, of course)

Anything else is, in my opinion, utterly worthless for a busy office environment where you are scanning dozens of multi-page documents per day. I, e.g., stand by my scanner dropping in document after document in rapid succession. I never go to my computer, and all of my documents are searchable PDFs just about as fast as I can drop them in.

If anyone knows of a software solution with that kind of workflow only that works with general scanners, please let me know. I just made the mistake of buying a Lexmark multifunction that, since it came with ABBYYwhatever software is, effectively, a unifunction.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top