Extracting correctly the text from a pdf (UTF-8)

https://stackoverflow.com//questions/10649511

pdf
text
utf-8
text-extraction
pdf-extraction

11-12-2019
|

Question

I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc look like "„ ˛" and not "s, t, a" (or at least the displayed character). The text is displayed correctly but when I try to copy it for example, those characters are not OK.
Is there some way to extract the text correctly or are those pdf files corrupted in some way (java/C/python etc or windows/linux/etc utility)?

Solution

Can you extract the text correctly in Acrobat from the PDF?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow