Perl XML::LibXML $node->findnodes($xpath) finds nodes it shouldn't

https://stackoverflow.com/questions/11955052

26-06-2021
|

Question

Here's some code I am having problems with, I process some XML and in a method in an OO class I extract an element from each of several nodes that repeat in the document. There should only be one such element in the subtree for each node but my code gets all elements as if it is operating on the document as a whole.

Because I only expected to get oine element I only use the zeroth element of an array, this leads my function to output the wrong value (and its the same for all items in the document)

Here's some simplified code that illustrates the problem

$ cat t4.pl
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;

my $xml = <<EndXML;
<Envelope>
  <Body>
    <Reply>
      <List>
        <Item>
          <Id>8b9a</Id>
          <Message>
            <Response>
              <Identifier>55D</Identifier>
            </Response>
          </Message>
        </Item>
        <Item>
          <Id>5350</Id>
          <Message>
            <Response>
              <Identifier>56D</Identifier>
            </Response>
          </Message>
        </Item>
      </List>
    </Reply>
  </Body>
</Envelope>
EndXML

my $foo = Foo->new();

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_string( $xml );
my @list   = $doc->getElementsByTagName( 'Item' );

for my $item ( @list ) {

    my $id = get( $item, 'Id' );
    my @messages = $item->getElementsByLocalName( 'Message' );

    for my $message ( @messages ) {

        my @children = $message->getChildNodes();

        for my $child ( @children ) {

            my $name = $child->nodeName;

            if ( $name eq 'Response' ) {
                print "child is a Response\n";
                $foo->do( $child, $id );
            }
            elsif ( $name eq 'text' ) {

                # ignore whitespace between elements
            }
            else {
                print "child name is '$name'\n";
            }
        }    # child
    }    # Message
}    # Item

# ..............................................

sub get {
    my ( $node, $name ) = @_;

    my $value   = "(Element $name not found)";
    my @targets = $node->getElementsByTagName( $name );

    if ( @targets ) {
        my $target = $targets[0];
        $value = $target->textContent;
    }

    return $value;
}

# ..............................................

package Foo;

sub new {
    my $self = {};
    bless $self;
    return $self;
}

sub do {
    my $self = shift;
    my ( $node, $id ) = @_;

    print '-' x 70, "\n", ' ' x 12, $node->toString( 1 ), "\n", '-' x 70, "\n";

    my @identifiers = $node->findnodes( '//Identifier' );
    print "do() found ", scalar @identifiers, " Identifiers\n";

    print "$id, ", $identifiers[0]->textContent, "\n\n";
}

Here's the output

$ perl t4.pl
child is a Response
----------------------------------------------------------------------
            <Response>
              <Identifier>55D</Identifier>
            </Response>
----------------------------------------------------------------------
do() found 2 Identifiers
8b9a, 55D

child is a Response
----------------------------------------------------------------------
            <Response>
              <Identifier>56D</Identifier>
            </Response>
----------------------------------------------------------------------
do() found 2 Identifiers
5350, 55D

I was expecting

do() found 1 Identifiers

I was expecting the last line to be

5350, 56D

I am using an old version of XML::LibXML due to platform issues.

Q: Does the problem exist in later versions or am I doing something wrong?

Solution

From the documentation of XPath 1.0

//para selects all the para descendants of the document root

(emphasis my own). So your call

$node->findnodes( '//Identifier' )

is ignoring the context node $node and searching for all Identifier elements anywhere in the document

To get all Identifier descendants of the context node you must add a dot, like this

$node->findnodes('.//Identifier');

but since $node is always a Response element and Identifier is a direct child of Response you can just write

$node->findnodes('Identifier');

You seem to have got yourself a little tied up writing this. I know you have cut the code down as an example, but do you really need the separate package? Much can be done with judicious application of XPath.

The most obvious change is that you don't need to loop through all children - you can simply pick out the ones you're interested in.

This refactored code may be worth reading

use strict;
use warnings;

use XML::LibXML;

my $parser = XML::LibXML->new;
my $doc    = $parser->parse_fh(*DATA);

for my $item ( $doc->findnodes('//Item') ) {

    print "\n";

    my ($id) = $item->findvalue('Id');
    printf "Item Id: %s\n", $item->findvalue('Id');

    my @messages = $item->findnodes('Message');

    for my $message (@messages) {
        my ($response) = $message->findnodes('Response');
        printf "Response Identifier: %s\n", $response->findvalue('Identifier');
    }
}

__DATA__
<Envelope>
  <Body>
    <Reply>
      <List>
        <Item>
          <Id>8b9a</Id>
          <Message>
            <Response>
              <Identifier>55D</Identifier>
            </Response>
          </Message>
        </Item>
        <Item>
          <Id>5350</Id>
          <Message>
            <Response>
              <Identifier>56D</Identifier>
            </Response>
          </Message>
        </Item>
      </List>
    </Reply>
  </Body>
</Envelope>

output

Item Id: 8b9a
Response Identifier: 55D

Item Id: 5350
Response Identifier: 56D

OTHER TIPS

I have no comment on the quality of the code, but having learned to use XML::DOM before I used XML::LibXML I have a tendancy to use some of the DOM syntax. I have been trying to beat this habit out of me :).
The reason I mention this is because I see you have used the equivalent of ->item(0) to get the first position from a nodelist as you would in DOM.
XML::LibXML supports use of ->item() but from cpan I can see that xpath creates nodelists starting at 1 not 0 like DOM. I am pretty sure that if you leave your code as is and look for the 1st array position not the 0th, you will get the result you want.
What is not clear is why ->item(0) gives you the last result as it seems to do from my testing (is it perhaps offset from an array value so that you are in fact returned the -1th array value)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow