Inaccurate pdf to text conversion

https://stackoverflow.com/questions/17786192

03-06-2022
|

题

I have tried almost every pdf to text converter available on Linux, but some parts of text are corrupted/inaccurate. Like some characters are replaced with others, some words are missing from text which are present in the pdf. For some words converted text contains semicolons etc.

I also tried aspell so that i can correct the words, but aspell remains silent on some words.

NOTE: The pdf contains swedish language text.

So, Is there any solution to fix this inaccuracy in pdf to text conversion?

解决方案

No. I think there is no working solution for all pdf files, since the actual text underlaying the displayed visual text can be stored in various flavours.

When pdfs are generated by LaTeX for example, it depends on several configuration options, how some non-ascii-characters are embedded. Sometimes I got :o instead of ö, sometimes o: and sometimes the character was embedded directly. Each of these variants where displayed as ö though.

If you copy and paste the text with your favorite pdf-viewer or try to search for the corrupted word, you perhaps will see the same effects.

To come around those issues one may use ocr software - with all disadvantages on recognition of these tools.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow