I am using pdftotext with options "-enc utf-8 -htmlmeta -raw" and passing that into a python script, which is parsing the output. (Please read on even if you're unfamiliar with pdftotext, since that may not be relevant.)

For some of the pdf's that we are processing, pdftotext is outputting metadata that looks like this:

<meta name="CreationDate" content="<FE><FF>">

In python, I am doing this (basically):

attrib[name] = content.decode('utf-8')

where content is that <FE><FF> string in the above piece of metadata. Python raises an exception:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte

At this point, I am unsure if the problem is the PDF itself, or the output from pdftotext, or Python's way of interpreting utf-8.

I have googled and not found anything conclusive.

Essentially, I would expect pdftotext -enc utf-8 to only output valid utf-8. And I would expect Python to understand how to deal with that utf-8 when decoding. Is there some part of this that I am missing?

I would appreciate any help in understanding why this is occurring, and help with a solution.

Thanks!

有帮助吗?

解决方案

Two things:

First, instead of using content.decode('utf-8'), use:

content.decode('utf-8-sig')

This will automatically remove the BOM (if one is present).

Second, it looks like pdftotext is outputting a UTF-16 BOM, not a UTF-8 one. The UTF-8 BOM is '\xEF\xBB\xBF'. You'll need to figure out why you're getting UTF-16, or change your script to decode from UTF-16.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top