Using perl to utilize pdftotext for the purpose of extracting text from a pdf. Works great. My issue is that the pdf's I am reading are multi-page and I am looking for data on specific lines at the top each page. The following code dumps the entire contents of both pages to one file. Because the data length after the constant data (at the top of page) varies I can't accurately pull my data from page 2. How would I step through each page either using pdftotext or some other utility/module first, then call pdftotext on each page individually?

#!/usr/bin/perl
print "Content-type: text/html\n\n";

print "\n<style>
div.line {width:100%;white-space:nowrap;}
div.line div {width:80px;float:left;}
</style>";

my $i=0;
open FILE, "pdftotext -layout my_multi_page_pdf.pdf - |";

while (<FILE>) {

    $i++;
    my ($line) = $_;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}
close FILE;
有帮助吗?

解决方案

use strict;
use warnings;

my $i       = 0;
my $pageNum = 1;

open my $fh, "pdftotext -layout multipage.pdf - |" or die $!;
print "---------- Begin Page $pageNum ----------\n";

while ( my $line = <$fh> ) {
    if ( $line =~ /\xC/ ) {
        print "\n---------- End Page $pageNum ----------\n";
        $pageNum++;
        print "---------- Begin Page $pageNum ----------\n";
    }

    $i++;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}

close $fh;
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top