문제

Using perl to utilize pdftotext for the purpose of extracting text from a pdf. Works great. My issue is that the pdf's I am reading are multi-page and I am looking for data on specific lines at the top each page. The following code dumps the entire contents of both pages to one file. Because the data length after the constant data (at the top of page) varies I can't accurately pull my data from page 2. How would I step through each page either using pdftotext or some other utility/module first, then call pdftotext on each page individually?

#!/usr/bin/perl
print "Content-type: text/html\n\n";

print "\n<style>
div.line {width:100%;white-space:nowrap;}
div.line div {width:80px;float:left;}
</style>";

my $i=0;
open FILE, "pdftotext -layout my_multi_page_pdf.pdf - |";

while (<FILE>) {

    $i++;
    my ($line) = $_;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}
close FILE;
도움이 되었습니까?

해결책

use strict;
use warnings;

my $i       = 0;
my $pageNum = 1;

open my $fh, "pdftotext -layout multipage.pdf - |" or die $!;
print "---------- Begin Page $pageNum ----------\n";

while ( my $line = <$fh> ) {
    if ( $line =~ /\xC/ ) {
        print "\n---------- End Page $pageNum ----------\n";
        $pageNum++;
        print "---------- Begin Page $pageNum ----------\n";
    }

    $i++;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}

close $fh;
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top