I am using pdftotext
with options "-enc utf-8 -htmlmeta -raw" and passing that into a python script, which is parsing the output. (Please read on even if you're unfamiliar with pdftotext, since that may not be relevant.)
For some of the pdf's that we are processing, pdftotext is outputting metadata that looks like this:
<meta name="CreationDate" content="<FE><FF>">
In python, I am doing this (basically):
attrib[name] = content.decode('utf-8')
where content
is that <FE><FF>
string in the above piece of metadata. Python raises an exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte
At this point, I am unsure if the problem is the PDF itself, or the output from pdftotext, or Python's way of interpreting utf-8.
I have googled and not found anything conclusive.
Essentially, I would expect pdftotext -enc utf-8
to only output valid utf-8. And I would expect Python to understand how to deal with that utf-8 when decoding. Is there some part of this that I am missing?
I would appreciate any help in understanding why this is occurring, and help with a solution.
Thanks!