I am using pdfbox to read/replace PDF text using standard documented way i.e. through COSString (Tj and TJ operators) . It seemed to be working fine until it was tested against following PDF file:

http://www.ocs.fas.harvard.edu/students/materials/resumes_and_cover_letters.pdf

It works fine till page 7 but later on the read data is in strange form. Below is the few lines of output:

S˛˚ R˚˘˚RESUMES AND COVER LETTERSPeter J. Lee      : L Q W K U R S  0 D L O  & H Q W H U  ±  & D P E U L G J H   0 D V V D F K X V H W W V                     ±  S M O H H # I D V  K D U Y D U G  H G X  

What can be the reason for that?

Thanks, Usman

有帮助吗?

解决方案

read/replace PDF text using standard documented way i.e. through COSString (Tj and TJ operators)

This "documented way" unfortunately is very misleading for two reasons:

  1. It assumes that the string parameters of Tj and TJ are encoded in some standard encoding. Actually the encoding is governed By The font and may be a completely custom-made one. Depending on the font type, the encoding may even be a multibyte encoding.

  2. It assumes letters and whole words come in the same order, unbroken, as you read them. This also need not be the case.

PDF simply is not a format designed for editing content. It can be done pretty easily, though, in simply designed ones, in general, though, it is really difficult.

PS: The strange output from your sample document is due to the use of a composite font using Identity-H encoding which embeds a subset of TimesNewRoman.

That font does contain a ToUnicode mapping; thus, translating what you read to character data is possible.

Replacing that text could be a problem , though, because only a subset is embedded; e.g. the capital letters 'I' and 'J' are not embedded and cannot be used in a replacement unless you either use a different font or possibly even add to the partial fonts. Neither of these operations is as simple as your original code.

And this is not the worst imaginable scenario, sometimes there is no information on how to interpret the raw data in the string as text, the PDF only knows how to draw the glyphs.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top