HTML Treebuilder XPath to Extract Links

https://stackoverflow.com/questions/11740605

23-06-2021
|

Question

I am writing a basic script which just extracts all the links from a web page. It is written in Perl and makes use of WWW::Mechanize and HTML::Treebuilder::Xpath modules, both of which I have installed through CPAN.

I know it can be easily done using only WWW::Mechanize, however would like to learn to do it using XPath as well.

So, the script will parse the entire web page, and check the href attribute for every anchor tag, extract the link and print it to the console/write it to a file. Please note that in the script below, I have not used use strict, since I am only writing this to clarify and understand the concept of using XPath to traverse the HTML Tree.

here is the script:

#! /usr/bin/perl

use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use warnings;

$url="https://example.com";

$mech=WWW::Mechanize->new();
$mech->get($url);

$tree=HTML::TreeBuilder::XPath->new();

$tree->parse($mech->content);

$nodes=$tree->findnodes(q{'//a'}); # line is modified later.

foreach $node($nodes)
{
    print $node->attr('href');
}

And it gives an error:

Can't locate object method "attr" via package "XML::XPathEngine::Literal" at pagegetter.pl line 23.

I have modified the script as follows:

$nodes=$tree->findnodes(q{'//a/@href'});

while($node=$nodes->shift)
{
  print $node->attr('href');
}

Error:

Can't locate object method "shift" via package "XML::XPathEngine::Literal"

I am not sure, how to print the value of the href attribute.

$nodes should hold the list of all the href attributes? I believe it does not store the value but instead pointers to it?

I tried searching and reading examples, however I am not sure how to go about it.

Thanks.

Solution

There are a couple of mistakes. Repairs:

# list context
my @nodes = $tree->findnodes(
    q{//a}       # just a string, not a string containings quotes
);

# iterate over array
for my $node (@nodes) {

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow