Question

The documentation on CPAN doesn't really explain this behavior unless I'm missing something. I've put together some quick test code to illustrate my problem:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

my $testHtml = " 
<body>
        <h1>
                <p> 
                        <p>HELLO!
                        </p> 
                </p> 
        </h1>
</body>";

my $parsedPage = HTML::TreeBuilder->new;
$parsedPage->parse($testHtml);
$parsedPage->eof();

my @p = $parsedPage->look_down('_tag' => 'p');

foreach (@p) {print $_->parent->tag, " : ", $_->tag, "\t", $_->as_text, "\n";}

After running the above script, the output is:

body : p

body : p        HELLO! 

Seeing as all the tags are nested one after another, I would think that the parent of the first p tag would be h1, and the parent of the second p tag would be p. Why is the parent function showing the body tag for both?

Was it helpful?

Solution

Your HTML is invalid. And given that HTML::TreeBuilder is a subclass of HTML::Parser, I can only assume that the parser is doing what it can to transform your document into valid HTML.

You can call $parsedPage->as_HTML to see what the parser has done to your HTML. It gives me this:

<html><head></head><body><h1></h1><p><p>HELLO! </body></html>

Perhaps you should pass your HTML through a validator or HTML::Tidy, before processing it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top