The document presented by the OP contains e.g. this line
which quite likely is a sample of the issue he has identified.
Looking at the page content stream, though,
[(A)32(ttila)-384(G\023)575(obi,)-383
(Zal)8(\023)567(an)-383(Sz)-32(})607(ugyi)-384(and)-383
(T)96(am)8(\023)567(as)-384(Kozsik)]TJ
one sees that e.g. in (G\023)575(obi,)
ó is created by first drawing the ´ (\023), then going back the width of that glyph (575), and then drawing the o.
Thus, you do have these two glyphs ´ and o printed in the same location, not a single glyph ó.
PDFBox PDFTextStripper
currently does not combine characters printed at the same location other than dropping the identical glyph drawn twice at about the same location.
Thus aside from replaceAll("o'","ó") as mentioned by the OP, one can also extend the PDFTextStripper
to combine certain glyphs, either early in its method processTextPosition
or late in writeString(String text, List<TextPosition> textPositions)
.