Question

I am trying to use XPath to extract some HTML tags and data and for that I need to use XML::LibXML module.

I tried installing it from CPAN shell but it doesn't install.

I followed the instructions from CPAN site about the installation, that we need to install libxml2, iconv and zlib wrappers before installing XML::LibXML and it didn't work out.

Also, if there is any other simpler module that gets my task done, please let me know.

The task at hand:

I am searching for a specific <dd> tag on a html page which is really big ( around 5000 - 10000) <dd> and <dt> tags. So, I am writing a script which matches the content within <dd> tag and fetches the content within the corresponding (next) <dt> tag.

I wish i could i have been a little more clearer. Any help is greatly appreciated.

Was it helpful?

Solution

If you are using ActiveState Perl, you should add the repositories listed at ActivePerl 10xx Win32 PPM packages to ppm and then use

ppm install XML::LibXML

Trying to parse HTML as XML is generally not a pleasant task. I think HTML::TokeParser is more suitable to the task.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;

my $p = HTML::TokeParser->new(\*DATA);

my @definitions;

while ( my $dl_tag = $p->get_tag('dl') ) {
    while ( my $dt_tag = $p->get_tag('dt') ) {
        my $term = $p->get_trimmed_text('/dt');
        my $dd_tag = $p->get_tag('dd');
        my $defn = $p->get_trimmed_text('/dd');
        push @definitions, [$term, $defn];
    }
}

use Data::Dumper;
print Dumper \@definitions;

__DATA__
<dl>
<dt>One</dt>
<dd>1</dd>
<dt>Two</dt>
<dd>2</dd>
</dl>

Output:

$VAR1 = [
          [
            'One',
            '1'
          ],
          [
            'Two',
            '2'
          ]
        ];

OTHER TIPS

If you just want XPath queries then I just wrote a script yesterday that uses XML::XPath::XMLParser to do XPath queries on an xml file.

I have tested it with both Activestate's perl installation and with strawberry perl on windows.

I don't remember having to go to cpan to install any modules( though I may have earlier and forgot doing so:)), so perhaps you can use the XML::XPath module instead?

Here is the sample from the documentation

use XML::XPath;
use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => 'test.xhtml');

my $nodeset = $xp->find('/html/body/p'); # find all paragraphs

foreach my $node ($nodeset->get_nodelist) {
    print "FOUND\n\n", 
        XML::XPath::XMLParser::as_string($node),
        "\n\n";
}

Assuming that you are using ActiveState Perl, you can get XML::LibXML working just fine. You can get XML::LibXML from Randy Kobes' site and you get libxslt/libxml, etc from zlatkovic.com

I just install libxml first and then use ppm to install XML::LibXML. Works just fine.

If you are using Strawberry Perl, CPAN should work for you as libxml2, etc are part of the Strawberry Perl distribution I believe.

Also see my post in the thread How do I install XML::LibXML for ActivePerl?.

Discusses some issues/solutions I encountered installing XML-LibXML using PPM.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top