pyPdf: illegal UTF-16 surrogate

https://stackoverflow.com/questions/15673335

30-03-2022
|

سؤال

I have a pdf file that breaks pyPdf: http://tovotu.de/tests/test.pdf

This is the sample script:

from pyPdf import PdfFileWriter, PdfFileReader

outputPdf = PdfFileWriter()

inpdf = open("test.pdf","rb")
inputPdf = PdfFileReader(inpdf)
[outputPdf.addPage(x) for x in inputPdf.pages]

with open("output.pdf","wb") as outpdf:
    outputPdf.write(outpdf)

Error output is here: http://pastebin.com/0m38zhjQ

The error is the same when using PyPDF2 from GitHub. pdftk can handle this pdf just like any other pdf out there. Please note, that writing fails, but reading seems to work just fine!

Can you at least point me to the exact part of the pdf that causes that error? A workaround would be even nicer :)

المحلول

Looks like a bug in PyPDF2. In this section:

if string.startswith(codecs.BOM_UTF16_BE):
    retval = TextStringObject(string.decode("utf-16"))
    retval.autodetect_utf16 = True

it assumes that any string starting with (0xFE, 0xFF) can be decoded as UTF-16. Your file contains a bytestring that begins that way but then contains invalid UTF-16.

The simplest fix is to comment out that if and unconditionally use the # This is probably a big performance hit here branch.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow