Question

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.

pdftotext -enc UTF-8 book1.pdf book1.txt

Please help me to resolve this issue.

Thanks in advance,

Was it helpful?

Solution

You can get a list of available encodings using the command:

pdftotext -listenc

and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous

pdftotext -enc UTF-8 your.pdf

You may want to check your locale (LC_ALL, LANG, ...).

EDIT: I downloaded the following PDF: http://www.i18nguy.com/unicode/unicodeexample.pdf

and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:

pdftotext.exe -enc UTF-8 unicodeexample.pdf

The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.

Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.

OTHER TIPS

Things are getting a little bit messy, so I'm adding another answer.

I took the PDF apart and my best guess would be a "problem" with the font used:

  • open the PDF file in Acrobar Reader
  • select all the text on the page
  • copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)

You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top