Pergunta

I wrote the following function to read the text out of a PDF file. It is pretty close, but I'm just not familiar enough with all the op codes to get the line spacing right. For example, I'm currently inserting a new line when I see "ET" but that doesn't seem quite right since it may just be the end of a text run, mid line. Could someone help me adjust the parsing? My goal is something similar to Adobe Reader's "Save as other" --> "Text"

Public Function ReadPDFFile(filePath As String,
                            Optional maxLength As Integer = 0) As String

    Dim sbContents As New StringBuilder

    Dim cArrayType As Type = GetType(CArray)
    Dim cCommentType As Type = GetType(CComment)
    Dim cIntegerType As Type = GetType(CInteger)
    Dim cNameType As Type = GetType(CName)
    Dim cNumberType As Type = GetType(CNumber)
    Dim cOperatorType As Type = GetType(COperator)
    Dim cRealType As Type = GetType(CReal)
    Dim cSequenceType As Type = GetType(CSequence)
    Dim cStringType As Type = GetType(CString)
    Dim opCodeNameType As Type = GetType(OpCodeName)

    Dim ReadObject As Action(Of CObject) = Sub(obj As CObject)

                                               Dim objType As Type = obj.GetType

                                               Select Case objType
                                                   Case cArrayType
                                                       Dim arrObj As CArray = DirectCast(obj, CArray)
                                                       For Each member As CObject In arrObj
                                                           ReadObject(member)
                                                       Next
                                                   Case cOperatorType
                                                       Dim opObj As COperator = DirectCast(obj, COperator)
                                                       Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName)
                                                           Case "ET", "Tx"
                                                               sbContents.Append(vbNewLine)
                                                           Case "Tj", "TJ"
                                                               For Each operand As CObject In opObj.Operands
                                                                   ReadObject(operand)
                                                               Next
                                                           Case "QuoteSingle", "QuoteDbl"
                                                               sbContents.Append(vbNewLine)
                                                               For Each operand As CObject In opObj.Operands
                                                                   ReadObject(operand)
                                                               Next
                                                           Case Else
                                                               'Do Nothing
                                                       End Select
                                                   Case cSequenceType
                                                       Dim seqObj As CSequence = DirectCast(obj, CSequence)
                                                       For Each member As CObject In seqObj
                                                           ReadObject(member)
                                                       Next
                                                   Case cStringType
                                                       sbContents.Append(DirectCast(obj, CString).Value)
                                                   Case cCommentType, cIntegerType, cNameType, cNumberType, cRealType
                                                       'Do Nothing
                                                   Case Else
                                                       Throw New NotImplementedException(obj.GetType().AssemblyQualifiedName)
                                               End Select

                                           End Sub

    Using pd As PdfDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly)

        For Each page As PdfPage In pd.Pages

            ReadObject(ContentReader.ReadContent(page))

            If maxLength > 0 And sbContents.Length >= maxLength Then
                If sbContents.Length > maxLength Then
                    sbContents.Remove(maxLength - 1, sbContents.Length - maxLength)
                End If
                Exit For
            End If

            sbContents.Append(vbNewLine)

        Next

    End Using

    Return sbContents.ToString

End Function
Foi útil?

Solução

Your code is ignoring almost all operations which change the line. You do consider ' and " which most often imply a change of line but which in the wild are seldom used.

Inside a text object (BT .. ET) you, therefore, should also look out for

  • tx ty Td Move to the start of the next line, offset from the start of the current line by (tx, ty).
  • tx ty TD Move to the start of the next line, offset from the start of the current line by (tx, ty). As a side effect, this operator shall set the leading parameter in the text state.
  • a b c d e f Tm Set the text matrix, Tm, and the text line matrix, Tlm.
  • T* Move to the start of the next line.

To interpret ', " and T* correctly, you should also look out for

  • leading TL Set the text leading, Tl, to leading.

If you find multiple text objects (BT .. ET .. BT .. ET), the second one is not necessarily on a new line. You should look out for the special graphics state operators between them:

  • a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix
  • q Save the current graphics state
  • Q Restore the graphics state

Your code is ignoring all numeric arguments to the operations. You should not ignore them, especially:

  • You should check the parameters of the operators listed above; e.g. while 0 -20 Td starts a new line 20 units down, 20 0 Td remains on the same line and merely starts drawing text 20 units right of the former line start.
  • You should check the numeric elements of the array parameter of TJ as they may (or may not!) indicate space between two words.

Your code is assuming the Value of CString instances to already contain Unicode encoded character data. This assumption in general is incorrect, the encoding used in PDF strings drawn in text drawing operations is ruled by the font. Thus, you furthermore should also look out for

  • font size Tf Set the text font, Tf, to font and the text font size, Tfs, to size. font shall be the name of a font resource in the Font subdictionary of the current resource dictionary.

For details you should first and foremost study the PDF specification ISO-32000-1, especially chapter 9 Text with a solid background from chapter 8 Graphics.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top