Question

I'm trying to extract values from XML with Nokogiri.

I want to store, separated in an array, the child elements with the same name but different xpath. Those elements are ProdA, ProdB.

Currently I'm only trying to print the child elements, but the code I have so far prints only "SDocument" and not the child elements.

The goal is have an array like this:

array = [["2","8"], ["8","9"]]

This is the code:

#!/usr/bin/env ruby
require 'nokogiri'

doc = Nokogiri::XML(File.open("input.xml"))

a = doc.xpath("//SDocument").each do |n|
  n if n.text?
end

puts a

This is the XML:

<?xml version="1.0" encoding="UTF-8"?>
<Document-St-5>
  <SDocument>
    <ItemList>
      <Items_A>
        <ItemElem>
          <Item_Values>
            <ProdA>2</ProdA>
            <ProdB>8</ProdB>
          </Item_Values>
        </ItemElem>        
      </Items_A>
      <Items_B>
        <ItemElem>
          <Item_Values>
            <ProdA>8</ProdA>
            <ProdB>9</ProdB>
          </Item_Values>
        </ItemElem>
      </Items_B>
    </ItemList>
  </SDocument>
</Document-St-5>

Can somebody point me to the correct way please?


Update:

What I actually want is to store, in an array, the XPath of all unique child elements of SDocument node and those that have multiple occurences, store them grouped. But if possible get the XPath without knowing the name of the children, only get unique XPaths.

For example:

The child elements StName and StCode only have one occurence each one, then the array that has the XPath so far would be:

arr_Xpath = [ ["/Document-St-5/SDocument/StName"], ["/Document-St-5/SDocument/StCode"], ... ]

The ProdA node's that are children of node Items_A have the following XPath:

/Document-St-5/SDocument/ItemList/Items_A/ItemElem/Item_Values/ProdA

The ProdA node's that are children of node Items_B have the following XPath:

/Document-St-5/SDocument/ItemList/Items_B/ItemElem/Item_Values/ProdA

Then the array of unique XPath of child elements would be (including ProdB node's XPath):

arr_Xpath = [ "/Document-St-5/SDocument/StName", 
        "/Document-St-5/SDocument/StCode", 
        "/Document-St-5/SDocument/ItemList/Items_A/ItemElem/Item_Values/ProdA", 
        "/Document-St-5/SDocument/ItemList/Items_A/ItemElem/Item_Values/ProdB",
        "/Document-St-5/SDocument/ItemList/Items_B/ItemElem/Item_Values/ProdA",
                  "/Document-St-5/SDocument/ItemList/Items_B/ItemElem/Item_Values/ProdB" ]

I think, knowing first the unique XPaths, it would be possible to use doc.xpath("..") to get values of each child element and group them if it has more than one occurence. So, the final array I'd like to get is:

arr_Values = [ ["WERLJ01"], ["MEKLD"],["2","9"],["8","3"],["1"],["17"]]

Where:

  • arr_Values[0] is the array that contains StName values
  • arr_Values[1] is the array that contains StCode values
  • arr_Values[2] is the array that contains the values of all the ProdA node's children of Items_A.
  • arr_Values[3] is the array that contains the values of all the ProdB node's children of Items_A.
  • arr_Values[4] is the array that contains the values of all the ProdA node's children of Items_B.
  • arr_Values[5] is the array that contains the values of all the ProdB node's children of Items_B.

An XML example is:

<?xml version="1.0" encoding="UTF-8"?>
<Document-St-5>
  <SDocument>
    <StName>WERLJ01</StName>
    <StCode>MEKLD</StCode>
  <ItemList>
    <Items_A>
      <ItemElem>
        <Item_Values>
          <ProdA>2</ProdA>
          <ProdB>8</ProdB>
        </Item_Values>
      </ItemElem>        
    </Items_A>
    <Items_A>
      <ItemElem>
        <Item_Values>
          <ProdA>9</ProdA>
          <ProdB>3</ProdB>
        </Item_Values>
      </ItemElem>        
    </Items_A>       
    <Items_B>
      <ItemElem>
        <Item_Values>
          <ProdA>1</ProdA>
          <ProdB>17</ProdB>
        </Item_Values>
      </ItemElem>
    </Items_B>
  </ItemList>
  </SDocument>
</Document-St-5>  

Update 2:

Hello the Tin Man, it works! What does it mean the "%w" and "%w[element1 element2]"? Does the form %w[...] accept more than 2 elements?

I newbie to Nokogiri, I only mention Xpath since the XML have more than 200 unique child nodes (unique Xpath's), then do you suggest me to use the same technique with CSS for all child nodes or is there a way to process the XML and do the same (group in array the elements with same name and that have same Xpath) without knowing the name of the child nodes? I'd like to know the way you suggest me.

Thanks again

No correct solution

OTHER TIPS

Here's one way:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<Document-St-5>
  <SDocument>
    <ItemList>
      <Items_A>
        <ItemElem>
          <Item_Values>
            <ProdA>2</ProdA>
            <ProdB>8</ProdB>
          </Item_Values>
        </ItemElem>        
      </Items_A>
      <Items_B>
        <ItemElem>
          <Item_Values>
            <ProdA>8</ProdA>
            <ProdB>9</ProdB>
          </Item_Values>
        </ItemElem>
      </Items_B>
    </ItemList>
  </SDocument>
</Document-St-5>
EOT

data = doc.search('SDocument').map{ |node| 
  %w[ProdA ProdB].map{ |n| node.search(n).map(&:text) }
}


data # => [[["2", "8"], ["8", "9"]]]

It results in a bit deeper nesting than you want but it's close.

A little different way, perhaps more easily understood, is:

data = doc.search('SDocument').map{ |node| 
  %w[A B].map{ |ab|
    node.at("Items_#{ ab }").search('ProdA, ProdB').map(&:text)
  }
}

The reason the nesting is one-level deeper than you specified is, I'm assuming there will be multiple <SDocument> tags in the XML. If there won't be, then the code can be modified a bit to return the array as you're asking:

data = doc.search('Items_A, Items_B').map{ |node| 
  node.search('ProdA, ProdB').map(&:text)
}

data # => [["2", "8"], ["8", "9"]]

Notice I'm using CSS selectors, to make it easy to specify I want the code to look at two different nodes, both for Items_A and Items_B, and ProdA and ProdB.


Update after the question completely changed:

Here's the set-up:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<Document-St-5>
  <SDocument>
    <StName>WERLJ01</StName>
    <StCode>MEKLD</StCode>
  <ItemList>
    <Items_A>
      <ItemElem>
        <Item_Values>
          <ProdA>2</ProdA>
          <ProdB>8</ProdB>
        </Item_Values>
      </ItemElem>        
    </Items_A>
    <Items_A>
      <ItemElem>
        <Item_Values>
          <ProdA>9</ProdA>
          <ProdB>3</ProdB>
        </Item_Values>
      </ItemElem>        
    </Items_A>       
    <Items_B>
      <ItemElem>
        <Item_Values>
          <ProdA>1</ProdA>
          <ProdB>17</ProdB>
        </Item_Values>
      </ItemElem>
    </Items_B>
  </ItemList>
  </SDocument>
</Document-St-5>  
EOT

Here's the code:

data = %w[StName StCode].map{ |n| [doc.at(n).text] }
%w[ProdA ProdB].each do |prod|
  data << doc.search('Items_A').map{ |item| item.at(prod).text }
end
%w[ProdA ProdB].each do |prod|
  data << [doc.at("Items_B #{prod}").text]
end

Here's what was captured:

data # => [["WERLJ01"], ["MEKLD"], ["2", "9"], ["8", "3"], ["1"], ["17"]]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top