Your code is ignoring almost all operations which change the line. You do consider ' and " which most often imply a change of line but which in the wild are seldom used.
Inside a text object (BT .. ET) you, therefore, should also look out for
- tx ty Td Move to the start of the next line, offset from the start of the current line by (tx, ty).
- tx ty TD Move to the start of the next line, offset from the start of the current line by (tx, ty). As a side effect, this operator shall set the leading parameter in the text state.
- a b c d e f Tm Set the text matrix, Tm, and the text line matrix, Tlm.
- T* Move to the start of the next line.
To interpret ', " and T* correctly, you should also look out for
- leading TL Set the text leading, Tl, to leading.
If you find multiple text objects (BT .. ET .. BT .. ET), the second one is not necessarily on a new line. You should look out for the special graphics state operators between them:
- a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix
- q Save the current graphics state
- Q Restore the graphics state
Your code is ignoring all numeric arguments to the operations. You should not ignore them, especially:
- You should check the parameters of the operators listed above; e.g. while
0 -20 Td
starts a new line 20 units down,20 0 Td
remains on the same line and merely starts drawing text 20 units right of the former line start. - You should check the numeric elements of the array parameter of TJ as they may (or may not!) indicate space between two words.
Your code is assuming the Value
of CString
instances to already contain Unicode encoded character data. This assumption in general is incorrect, the encoding used in PDF strings drawn in text drawing operations is ruled by the font. Thus, you furthermore should also look out for
- font size Tf Set the text font, Tf, to font and the text font size, Tfs, to size. font shall be the name of a font resource in the Font subdictionary of the current resource dictionary.
For details you should first and foremost study the PDF specification ISO-32000-1, especially chapter 9 Text with a solid background from chapter 8 Graphics.