Question

I have a project that requires parsing complex XML data. I have decided to go with XML::Twig and it works really well for most part. I came across an issue where different pieces of information had the same tag name but in the different paths. Something like below where the DateOfBirth is used for two different fields.

  <doc:DForm xmlns:doc="urn:xml-gov-au:...">
    <doc:PersonsDetails>
       <doc:GivenName LanguageIdentifier="" LanguageLocaleIdentifier="">
          John
       </doc:GivenName>
       <doc:Surname LanguageIdentifier="" LanguageLocaleIdentifier="">
          Citizen
       </doc:Surname>
       <doc:DateOfBirth LanguageIdentifier="" LanguageLocaleIdentifier="">
          2012-06-14
       </doc:DateOfBirth>
    </doc:PersonsDetails>
    <doc:SupportingInformation>
       <doc:NumberOfSiblings>
       5.00
       </doc:NumberOfSiblings>
       <doc:SiblingsDetails>
         <doc:DateOfBirth LanguageIdentifier="" LanguageLocaleIdentifier="">
         2009-03-18
         </doc:DateOfBirth>
         <doc:Name LanguageIdentifier="" LanguageLocaleIdentifier="">
         James Citizen</doc:Name>
       </doc:SiblingsDetails>
       <doc:SiblingsDetails>
         <doc:DateOfBirth LanguageIdentifier="" LanguageLocaleIdentifier="">
            2006-08-17
         </doc:DateOfBirth>
         <doc:Name LanguageIdentifier="" LanguageLocaleIdentifier="">
            Jane Citizen
         </doc:Name>
       </doc:SiblingsDetails>
       <doc:Address>
           <doc:Street>25 test street<doc:Street>
           <doc:City>Melbourne <doc:City>
           <doc:PostalCode>3000<doc:PostalCode>
       <doc:Address>
    </doc:SupportingInformation>
    </doc:MCCPDForm>

I have setup several handlers to deal with different information, but as we did not need the Sibling details, it was being processed at the end based on a 2-level hash that maps the fields to XML elements.

Sample:

my %field = ( 
       "DetDateOfBirth" => {
    "type"    => "Date",
    "value"   => undef,
    "dbfield" => "DetDateOfBirth",
   },
)

So, when the sibling's DOB was being processed it would use the above hash element to set it up, but when the person's dob was processed, since there was already a value, it would just move to the next element.

So I set up another handler and made sure the information is processed before.

Now, the question is, imagine there were multiple cases where the same name is used for more than one element but in different paths. Do I just write more handlers, or is there a another way that better manages this kind of situation.

The code that is relevent:

my $namespace = "doc";
my $formname = "DForm";
enter code here
my $twig = XML::Twig->new(
    pretty_print  => 'indented',
    twig_handlers => {
        "$namespace:${formname}/$namespace:PersonsDetails/$namespace:Address" =>
          \&ProcessAddress,
        "$namespace:${formname}/$namespace:SupportingInformation" =>
          \&ProcessSupportingInformation,
        "bie1:PdfFile"           => \&DecodePDF,
        "$namespace:${formname}" => \&ProcessRecord,
    }
);


sub ProcessRecord {
    my $twg    = shift;
    my $record = shift;
    my $fld;
    my $value;
    my $irn;

    my $elt = $record;

    while ( $elt = $elt->next_elt($record) ) {
        $fld = $elt->tag();

        $fld =~ s/^$namespace\://;


        if ( defined $fields{$fld}{"type"} && $elt->text ) {
            if ( $fld =~ /NameOfPlaceInstitution|HospitalNameOfBirth/i ) {
                next if $elt->text =~ /Other location/i;
            }

            if ( !defined $fields{$fld}{"value"} ) {
                $fields{$fld}{"value"} = $elt->text;
            }

        }
    }
}

sub ProcessSupportingInformation {
    my $twg    = shift;
    my $record = shift;
    my $fld;
    my $value;
    my $parent;

    my $elt = $record;

    while ( $elt = $elt->next_elt($record) ) {
        $fld = $elt->tag();
        $fld =~ s/^$namespace\://;

        $parent = $elt->parent();

        next if ( $fld =~ /PCDATA/ );

        if ( defined $fields{$fld}{"type"} && $elt->text ) {
            if ( $fld =~ /PlaceOfDeathHospital/i ) {
                if ( $elt->text =~ /Other location/i ) {
                    next;
                }
            }

                    if ( $fld =~ /StreetAddress/i ) {
                $fields{"StreetAddressOfPerson"} = $elt->text;
            }
            else {
                if ( !defined $fields{$fld}{"value"} ) {
                    $fields{$fld}{"value"} = $elt->text;
                }
            }
        }
        else {
            $record->delete;
        }
    }

}

Just an FYI, the actual XML files are about 700 lines which includes an encoded PDF as well.

Another option would be to set another flag in the hash that maps the tags to db fields and set it when information is processed the first time.

Thanks

PS: Sorry for too many edits. I think I got it right now.

PPS: There are a sensitive info in the code as well as xml that I can't show, so I had to edit parts of it...

Was it helpful?

Solution

It is difficult to understand your exact situation as you have cut down the problem to the point where the XML is invalid (it starts with <doc:DForm> but ends with <doc:MCCPDForm>) and the Perl code doesn't correspond to the XML data.

However I think you are using XML::Twig wrongly. The "twigs" are meant primarily to reduce the XML file to a series of records that can be processed independently, and not as the basis for accessing individual elements inside the data.

You don't say how the <bie1:PdfFile> elements are related to the <PersonsDetails> so I can't comment on those, but it looks like there is no single element that contains the <PersonsDetails> and the related <SupportingInformation>, so they can be tied together only by their adjacency in the file.

If that is the case then I would put a handler on only those two elements, and the code would look something like the program below.

It is easy to distinguish the meaning of all the <DateOfBirth> elements as they are encountered in specific contexts - either within ProcessPersonDetails or within ProcessSupportingInformation as one of a list of siblings.

The program just prints the information available in your sample XML. It won't be too hard to build a database record instead and write it at the end of processing the last data for a given person.

Note also the call to purge which is necessary to remove the processed information from memory. Without this there are no benefits of dealing with twigs of data at a time instead of with the entire document

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(
    twig_handlers => {
        'doc:PersonsDetails' => \&ProcessPersonsDetails,
        'doc:SupportingInformation' => \&ProcessSupportingInformation
    }
);

$twig->parsefile('DForm.xml');


sub ProcessPersonsDetails {
    my ($twig, $record) = @_;
    print "PersonsDetails\n";
    for (qw/ doc:GivenName doc:Surname doc:DateOfBirth /) {
      print '  ', $record->first_child_trimmed_text($_), "\n";
    }
}

sub ProcessSupportingInformation {
    my ($twig, $record) = @_;
    print "SupportingInformation\n";
    for my $sibling ($record->children('doc:SiblingsDetails')) {
        print "  Sibling\n";
        for (qw/ doc:DateOfBirth doc:Name /) {
          print '    ', $sibling->first_child_trimmed_text($_), "\n";
        }
    }
    $twig->purge;
}

output

PersonsDetails
  John
  Citizen
  2012-06-14
SupportingInformation
  Sibling
    2009-03-18
    James Citizen
  Sibling
    2006-08-17
    Jane Citizen

Update

If there is only a single record per file then the ability of XML::Twig to process XML data incrementally is unnecessary and the whole document can be loaded at once and processed.

This program does exactly that, and produces identical output to the previous code. Without having to write handlers that are called during the parsing process the code is significantly more concise

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(discard_all_spaces => 1);
my $root = $twig->parsefile('DForm.xml')->root;

print "PersonsDetails\n";
my $details = $root->first_child('doc:PersonsDetails');
for (qw/ GivenName  Surname  DateOfBirth /) {
  my $value = $details->trimmed_field("doc:$_");
  print "  $value\n";
}

print "SupportingInformation\n";
my @siblings = $root->first_child('doc:SupportingInformation')->children;
for my $sib (@siblings) {
  print "  Sibling\n";
  for (qw/ Name  DateOfBirth /) {
    my $value = $sib->trimmed_field("doc:$_");
    print "    $value\n";
  }
}

OTHER TIPS

It's a bit difficult to answer your question without seeing any code, but have you looked at triggering the handler on a longer path, on doc:PersonsDetails/doc:DateOfBirth for example? This would ensure that the date is processed only in the right context.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top