perl XML::Simple for repeated elements

https://stackoverflow.com/questions/19110786

29-06-2022
|

题

I have the following xml code

<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Aug 26, 2013 10:02:03 +0900 (GMT+09:00) -->
<pathway name="path:ko01200" >
    <reaction id="14" name="rn:R01845" type="irreversible">
        <substrate id="108" name="cpd:C00447"/>
        <product id="109" name="cpd:C05382"/>
     </reaction>
    <reaction id="15" name="rn:R01641" type="reversible">
        <substrate id="109" name="cpd:C05382"/>
        <substrate id="104" name="cpd:C00118"/>
        <product id="110" name="cpd:C00117"/>
        <product id="112" name="cpd:C00231"/>
     </reaction>
</pathway>

I am trying to print the substrate id and product id with following code which I am stuck for the one that have more than one ID. Tried to use dumper to see the data structure but I don't know how to proceed. I have already used XML simple for the rest of my parsing script (this part is a small part of my whole script ) and I can not change that now

use strict;
use warnings;
use XML::Simple;
use Data::Dumper;
my $xml=new XML::Simple;
my $data=$xml->XMLin("test.xml",KeyAttr => ['id']);
print Dumper($data);
    foreach my $reaction ( sort  keys %{$data->{reaction}} ) {
        print $data->{reaction}->{$reaction}->{substrate}->{id}."\n"; 
        print $data->{reaction}->{$reaction}->{product}->{id}."\n";  

}

Here is the output

$VAR1 = {
      'name' => 'path:ko01200',
      'reaction' => {
                    '15' => {
                            'substrate' => {
                                           '104' => {
                                                    'name' => 'cpd:C00118'
                                                  },
                                           '109' => {
                                                    'name' => 'cpd:C05382'
                                                  }
                                         },
                            'name' => 'rn:R01641',
                            'type' => 'reversible',
                            'product' => {
                                         '112' => {
                                                  'name' => 'cpd:C00231'
                                                },
                                         '110' => {
                                                  'name' => 'cpd:C00117'
                                                }
                                       }
                          },
                    '14' => {
                            'substrate' => {
                                           'name' => 'cpd:C00447',
                                           'id' => '108'
                                         },
                            'name' => 'rn:R01845',
                            'type' => 'irreversible',
                            'product' => {
                                         'name' => 'cpd:C05382',
                                         'id' => '109'
                                       }
                          }
                  }
    };
 108
109
Use of uninitialized value in concatenation (.) or string at  line 12.
Use of uninitialized value in concatenation (.) or string at line 13.

解决方案

First of all, don't use XML::Simple. it is hard to predict what exact data structure it will produce from a bit of XML, and it's own documentation mentions it is deprecated.

Anyway, your problem is that you want to access an id field in the product and substrate subhashes – but they don't exist in one of the reaction subhashes

'15' => {
    'substrate' => {
         '104' => {
             'name' => 'cpd:C00118'
         },
         '109' => {
             'name' => 'cpd:C05382'
         }
     },
     'name' => 'rn:R01641',
     'type' => 'reversible',
     'product' => {
         '112' => {
             'name' => 'cpd:C00231'
         },
         '110' => {
             'name' => 'cpd:C00117'
         }
     }
 },

Instead, the keys are numbers, and each value is a hash containing a name. The other reaction has a totally different structure, so special-case code would have been written for both. This is why XML::Simple shouldn't be used – the output is just to unpredictable.

Enter XML::LibXML. It is not extraordinary, but it implememts standard APIs like the DOM and XPath to traverse your XML document.

use XML::LibXML;
use feature 'say'; # assuming perl 5.010

my $doc = XML::LibXML->load_xml(file => "test.xml") or die;

for my $reaction_item ($doc->findnodes('//reaction/product | //reaction/substrate')) {
  say $reaction_item->getAttribute('id');
}

Output:

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow