I'm pretty sure that's correct behavior (though I've always found the XmlSlurper and XmlParser to have screwy APIs). All things you can iterate through really should implement a node interface IMO and potentially have a type
of TEXT
that you could use to know to get the text from them.
Those text nodes are valid nodes that in many cases you'd want to hit as it did a depth first traversal through the XML. If they didn't get returned, your algorithm for checking if the children size of 1 wouldn't work because some nodes (like the <p>
tag) has both mixed text and elements underneath it.
Also, why depthFirst
doesn't consistently return all text nodes where the text is the only child, such as for italic
above, makes things even worse.
I tend to like to use the signature of groovy methods to let the runtime figure out which is the right way to handle each node (rather than using something like instanceof
) like this:
def rawXml = """<xml>
<metadata>
<article>
<body>
<sec>
<title>A Title</title>
<p>
This contains
<italic>italics</italic>
and
<xref ref-type="bibr">xref's</xref>
.
</p>
</sec>
<sec>
<title>Second Title</title>
</sec>
</body>
</article>
</metadata>
</xml>"""
def processNode(String nodeText) {
return nodeText
}
def processNode(Object node) {
if(node.children().size() == 1) {
return node.text()
}
}
def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
def xmlText = xml.metadata.article.body[0].'**'.findResults { node ->
processNode(node)
}
println xmlText.join(" ")
Prints
A Title This contains italics and xref's . Second Title
Alternatively, the XmlSlurper
class probably does more what you want/expect it to and has a more reasonable set of output from the text()
method. If you really don't need to do any sort of DOM walking with the results (what XmlParser
is "better" for), I'd suggest XmlSlurper
:
def xmlParser = new XmlSlurper()
def xml = xmlParser.parseText(rawXml)
def bodyText = xml.metadata.article.body[0].text()
println bodyText
Prints:
A Title
This contains
italics
and
xref's
.
Second Title