Question

I have an xml file in a flat structure. We do not control the format of this xml file, just have to deal with it. I've renamed the fields because they are highly domain specific and don't really make any difference to the problem.

<attribute name="Title">Book A</attribute>
<attribute name="Code">1</attribute>
<attribute name="Author">
   <value>James Berry</value>
   <value>John Smith</value>
</attribute>
<attribute name="Title">Book B</attribute>
<attribute name="Code">2</attribute>
<attribute name="Title">Book C</attribute>
<attribute name="Code">3</attribute>
<attribute name="Author">
    <value>James Berry</value>
</attribute>

Key things to note: the file is not particularly hierarchical. Books are delimited by an occurance of an attribute element with name='Title'. But the name='Author' attribute node is optional.

Is there a simple xpath statement I can use to find the authors of book 'n'? It is easy to identify the title of book 'n', but the authors value is optional. And you can't just take the following author because in the case of book 2, this would give the author for book 3.

I have written a state machine to parse this as a series of elements, but I can't help thinking there would have been a way to directly get the results that I want.

Was it helpful?

Solution

We want the "attribute" element of @name 'Author' that is following an "attribute" element of @name 'Title' with a value of 'Book n', without any other "attribute" element of @name 'Title' between them (because if there are, then the author authored some other book).

Said differently, it means we want an author of which the first preceding title (the one it "belongs to") is the one we're looking for:

//attribute[@name='Author']
[preceding-sibling::attribute[@name='Title'][1][contains(.,'Book N')]]

N=C => finds <attribute name="Author"><value>James Berry</value></attribute>

N=B => finds nothing

Using keys and/or grouping functions available in XSLT 2.0 would make this easier (and much faster if the file is big).

(SO code parser seems to think '//' stands for 'comments' but in XPath it's not!!! Sigh.)

OTHER TIPS

Well, I have used Elementtree to extract data from the above XML. I have saved this XML in file named foo.xml

from xml.etree.ElementTree import fromstring

def extract_data():
    """Returns list of dict of book and
    its authors."""

    f = open('foo.xml', 'r+')
    xml = f.read()
    elem = fromstring(xml)
    attribute_list = elem.findall('attribute')
    dic = {}
    lst = []

    for attribute in attribute_list:
        if attribute.attrib['name'] == 'Title':
            key = attribute.text
        if attribute.attrib['name'] == 'Author':
            for v in attribute.findall('value'):
                lst.append(v.text)
            value = lst
            lst = []
            dic[key] = value
    return dic

When you run this function you will get this:

{'Book A': ['James Berry', 'John Smith'], 'Book C': ['James Berry']}

I hope this is what you are looking for. If not then just specify a bit more. :)

As bambax noted in his answer, a solution using XSLT keys is more efficient, especially for large XML documents:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>
 <!--                                            -->
 <xsl:key name="kAuthByTitle" 
  match="attribute[@name='Author']"
  use="preceding-sibling::attribute[@name='Title'][1]"/>
 <!--                                            -->
    <xsl:template match="/">
      Book C Author:
      <xsl:copy-of select=
         "key('kAuthByTitle', 'Book C')"/>
  <!--                                            -->
         ====================
      Book B Author:
      <xsl:copy-of select=
         "key('kAuthByTitle', 'Book B')"/>
    </xsl:template>
</xsl:stylesheet>

When the above transformation is applied on this XML document:

<t>
    <attribute name="Title">Book A</attribute>
    <attribute name="Code">1</attribute>
    <attribute name="Author">
        <value>James Berry</value>
        <value>John Smith</value>
    </attribute>
    <attribute name="Title">Book B</attribute>
    <attribute name="Code">2</attribute>
    <attribute name="Title">Book C</attribute>
    <attribute name="Code">3</attribute>
    <attribute name="Author">
        <value>James Berry</value>
    </attribute>
</t>

the correct output is produced:

  Book C Author:
  <attribute name="Author">
    <value>James Berry</value>
</attribute>

     ====================
  Book B Author:

Do note that using the "//" XPath abbreviation should be avoided as much as possible, as it usually causes the whole XML document to be scanned on each evaluation of the XPath expression.

Select all titles and apply template

<xsl:template match="/">
  <xsl:apply-templates select="//attribute[@name='Title']"/>
</xsl:template>

In the the template output title, check if next title exist. If not, output following author. If it does exist, check if the following author node of following book is the same as following author node of the current book. If it is, it means that current book has no author:

<xsl:template match="*">
   <book>
     <title><xsl:value-of select="."/></title> 
   <author>
   <xsl:if test="not(following::attribute[@name='Title']) or following::attribute[@name='Author'] != following::attribute[@name='Title']/following::attribute[@name='Author']">
   <xsl:value-of select="following::attribute[@name='Author']"/>
   </xsl:if>
   </author>
   </book>
</xsl:template>

I am not sure you really want to go there: the simplest I found was to go from the author, get the previous title, then check that the first author or title following was indeed a title. Ugly!

/books/attribute[@name="Author"]
  [preceding-sibling::attribute[@name="Title" and string()="Book B"]
                               [following-sibling::attribute[ @name="Author" 
                                                             or @name="Title"
                                                            ]
                                 [1]
                                 [@name="Author"]
                               ]
  ][1]

(I added the books tag to wrap around the file).

I tested that with libxml2 BTW, using xml_grep2, but only on the sample data you gave, so more tests are welcome).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top