PDF to text mess up latin accents [duplicate]

https://stackoverflow.com/questions/20413478

pdf
latin1

29-08-2022
|

質問

I have a few pdf's written in Brazillian Portuguese which I'd like to parse and process. I tried using PDFBox text extraction command line tools( with no arguments at all ) but I get the following results:

Cão

ends up as

C~
ao

Also, copying and pasting the text or exporting it as text using Adobe Reader outputs the same results. Doing the same (PDFBox, copy&paste, Adobe Reader export) with other files I managed to extract the text as expected ("Cão") so , not being the PDF expert, I figure it has to do with the way the files were created. I'd like to know if anyone has seen such behavior and how to work around it when extracting the text.

解決

So thanks to Stack Overflow I managed to find the post below:

How to get text extraction from PDF to work?

which gave me the information I was looking for. Apparently the PDF's are being generated without the information needed to understand the latin characters.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow