문제
PDF가 있고, 텍스트로 구성되며 특별한 문자도없고 이미지 등이 없습니다. 각 페이지 라인을 선으로 구문 분석 할 수 있도록 도와줍니다. (PDF를 텍스트로 변환하여 잘못된 결과와 분리되지 않는 데이터를 산출합니다)
감사합니다
해결책
When I want to extract text from a PDF, I feed it to pdftohtml
(part of Poppler) using the -xml
output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page>
element for each page in the PDF, which contains <fontspec>
elements describing the fonts used and a <text>
element for each line of text. The <text>
elements may contain <b>
and <i>
tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top
and left
attributes of the <text>
tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).