Perl PDF Line por Line Parser?
-
14-11-2019 - |
Pergunta
Eu tenho um PDF, consiste apenas de texto, sem caracteres especiais nem imagens etc. Existe algum módulo Perl lá fora (estive olhando para CPAN sem sucesso) para me ajudar a analisar cada linha de página por linha? (Convertendo o PDF para o texto produz resultados ruins e dados sem poupamos)
Obrigado,
Solução
When I want to extract text from a PDF, I feed it to pdftohtml
(part of Poppler) using the -xml
output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page>
element for each page in the PDF, which contains <fontspec>
elements describing the fonts used and a <text>
element for each line of text. The <text>
elements may contain <b>
and <i>
tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top
and left
attributes of the <text>
tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).