Pergunta

I want to parse a simple XML document to hash using the XML::DOM module.

<?xml version ="1.0"?>
<Select>
  <book>
    <prop Name = "prop1" Title = "title1" />
    <prop Name = "prop2" Title = "title2" />
  </book>
  <fruit>
    <prop Name = "prop3" Title = "title3" />
    <prop Name = "prop4" Title = "title4" />
  </fruit>
</Select>

and the expected output is-

$VAR1 = {
  Select => {
    book  => {
               prop => [
                 { Name => "prop1", Title => "title1" },
                 { Name => "prop2", Title => "title2" },
               ],
             },
    fruit => {
               prop => [
                 { Name => "prop3", Title => "title3" },
                 { Name => "prop4", Title => "title4" },
               ],
             },
  },
}

I have written the code which is:

use strict;
use XML::DOM;
use Data::Dumper;

my @stack;
my %hash;
push @stack,\%hash;

my $parser = new XML::DOM::Parser;
my $doc = $parser -> parsefile('demo.xml');
my $root = $doc->getDocumentElement();
my $rootnode = $root->getTagName;

################################################################

foreach my $node ($doc->getElementsByTagName($rootnode)){
    push @stack,$stack[$#stack]->{$rootnode};
    my @childnode = $node->getChildNodes();

    foreach my $child(@childnode){
        if($child->isElementNode){
            my $childname = $child->getNodeName();
            pop(@stack);
            push @stack,$stack[$#stack]->{$rootnode} = {$childname,{}};
            my @childnodes2 = $child->getChildNodes();

            foreach my $subchild(@childnodes2){
                if($subchild->isElementNode){
                    my $subchildname = $subchild->getNodeName();

                    my $name = $subchild->getAttributes->getNamedItem('Name')->getNodeValue;
                    my $title = $subchild->getAttributes->getNamedItem('Title')->getNodeValue;
                    pop(@stack);
                    push @stack,$stack[$#stack]->{$rootnode}->{$child->getNodeName()} = {$subchildname,{}};    #{} contains $name or $title
                }
            }
        }
    }
}

print Dumper(\%hash);

I think, I am not able to correctly push and pop in array. Also, I don't want to use XML::Simple and recursion.

How can I do this in Perl?

Foi útil?

Solução

Here is a possible solution, assuming that the whole document follows a strict schema with one Select as root, any child nodes of different names (collisions won't be handled), and any numbers of props for these child nodes, of which the Name and Title field are interesting alone.

This is the preamble, I also used Carp for better error handling.

#!/usr/bin/perl

use strict; use warnings; use 5.012;
use XML::DOM;
use Data::Dumper;
use Carp;

Here is the main code. It launches a parser (assuming the documenent is in the special DATA filehandle), and passes the resulting document off the make_data_structure subroutine. I frequently consider to let the script die, to catch errors as early as possible.

{
    my $xml_parser = XML::DOM::Parser->new;
    my $document_string = do{ local $/=undef; <DATA> };
    my $document = $xml_parser->parse($document_string) or die;

    my $data_structure = make_data_structure($document) or die;
    print Dumper $data_structure;
}

This is the subroutine that does all the work. It takes a document and returns a hashref adhering to your format.

sub make_data_structure {
    my ($document) = @_;
    my $root = $document->getDocumentElement;
    my $rootname = $root->getTagName // "undef";

    didnt_expect_anything(but=> "Select", as=> "the root tag", got=> $rootname)
        unless $rootname eq "Select";

    my $dsc = +{ $rootname => +{} };
    CHILD:
    for my $child ($root->getChildNodes) {
        next CHILD unless $child->isElementNode;

        my $childname = $child->getTagName
            // couldnt_get("the tag name", of=> "a $rootname child");

        $dsc->{$rootname}{$childname} = undef; # unneccessary iff we have props
        PROP:
        for my $prop ($child->getChildNodes) {
            next PROP unless $prop->isElementNode;

            my $propname = $prop->getTagName // "undef";

            die didnt_expect_anything(but=> "prop", got=> $propname)
                unless $propname eq "prop";

            my $attributes = $prop->getAttributes
                // couldnt_get("the attributes", of=> "a prop node");

            # for minimum code duplication, and maximum error handling,
            # use dataflow programming, and `map`. 
            my ($Name, $Title) =
                map { $_->getNodeValue // couldnt_get("the node value", of=>"the attribute") }
                map { $attributes->getNamedItem($_) // couldnt_get("the named item $_", of=> "the prop attributes") }
                    qw/Name Title/;
            my $propvalue = +{
                Name    => $Name,
                Title   => $Title,
            };

            push @{ $dsc->{$rootname}{$childname}{$propname} }, $propvalue;
        }
    }
    return $dsc;
}

The following are custom error handling subroutines to make the above code more expressive.

sub didnt_expect_anything {
    my %args = @_;
    my $expected = $args{but} // croak qq(required named argument "but" missing);
    my $role     = $args{as}  // "a tag name";
    my $instead  = $args{got} // croak qq(required named argument "got" missing);
    croak qq(Didn't expect anything but "$expected" as $role here, got "$instead");
}
sub couldnt_get {
    my ($what, %args) = @_;
    my $of_what = $args{of} // croak qq(required named argument "of" missing);
    croak qq(Couldn't get $what of $of_what);
}

Of course, the correct output is produced, but this is not the correct way of getting there — CPAN was made to be used.

A part of the problem with your implementatin is (aside from the missing error handling), that you do some convoluted gymnastics with your "stack".

Before the first iteration of your outer loop, the @stack is +{} (a reference to an empty hash).

The line $stack[$#stack]->{$rootnode} accesses the last element of the stack (better written as $stack[-1]), treats the value as a hash ref, and looks up the entry named $rootnode. This evaluates to undef. You then push this value onto the stack. Chaos ensues.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top