I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and total due. When I open the PDF in a reader, the 'missing' text can be highlighted, copied, and pasted into an external editor. When I open it in Acrobat Pro, and view the content (View -> Show/Hide -> Navigation Panes -> Content), the text I need is there. How can I get it out without manually copying and pasting? (which is not an option, because I'll be doing this on thousands of PDFs)?

Here an example of what I'm dealing with. I have removed all sensitive data:

link to PDF

EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.

有帮助吗?

解决方案 2

I have solved this by getting the newest unreleased version of Ghostscript from git and building it. Now the txtwrite device gives me exactly what I need. Thanks to chrisl for his answer and comments leading me in the right direction.

其他提示

Recent releases of Ghostscript have a txtwrite device which is probably worth trying.

There is a VERY HACKY method to extract the data, but it only works with the older version of ghostscript, like 8.51 or 8.62. In the older version of ghostscript, the PDF commands are defined in /lib/pdf_ops.ps The new version does something else.

A tested version of version 8.62 is available here.

http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/8.62/gs862w32.exe/download

The text you are after is printed using /Tj {} def and /TJ {} def by adding a dup == to the beginning of each definition. (This could be made more sophisticated) I also didn't bother to worry about the font warning messages, but these would be filtered out if the data were written to file.

Some words are split into pieces and individual letters because kerning is being done. Given time, this could also be filtered.

modified /Tj from pdf_ops.ps /Tj { dup == 0 0 moveto Show settextposition } bdef

modified /TJ from pdf_ops.ps

/TJ { dup == 
  0 0 moveto {
    dup type /stringtype eq {
      Show
    } { -1000 div
      currentfont /ScaleMatrix .knownget { 0 get mul } if
      0 Vexch rmoveto
    } ifelse
  } forall settextposition
} bdef

output

(Help a neighbor within your county each month by contributing to The Salvation )
(Army's Project SHARE and Georgia Power will match your gift. To help, simply check )
($1, $2, $5, or $10 on the return portion of this bill. Starting next month, your pledge )
(amount will be included on your monthly bill.)
(Our business offices will be closed on December 24 and 25 for Christmas and January )
(1 for New Year's Day. In case of an emergency, please call us at the number on your )
(bill 24 hours a day, 7 days a week.)
(PLEASE KEEP THIS PORTION FOR YOUR RECORDS.)
(PLEASE RETURN THIS PORTION WITH YOUR PAYMENT, MAKING SURE THE RETURN ADDRESS SHOWS IN THE ENVELOPE WINDOW.)
(Account Number)
(Mail To:)

Isn't postscript fun?

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top