Question

I want to extract all the tables from an html file and print their contents in the following way each cell seperated by \t, each row separated by \n and each table separated by \n\n. The following is my script, when I changed it to findvalues on tr then whole tr is inserted as one element, and I even tried the other methods such as findnodes_as_strings ($path), I want to modify it to the above mentioned structure .

use strict;
use warnings;
use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "html.html");

my @values=$tree->findvalues(q{//table//tr//td});

print $_, "\n" foreach(@values);
Was it helpful?

Solution

You need to process each table separately, same for rows:

foreach my $table ( $tree->findnodes('//table') ) {

    foreach my $row ( $table->findnodes('.//tr') ) {

        my @cells = $row->findvalues('.//td');
        print join("\t", @cells), "\n";
    }
    print "\n";
}

Of course this is solution only for simple tables (think about columnspans, th, table inside table etc.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top