How to get multiple text nodes from XML using XPath, with knowledge of how they were broken apart

https://stackoverflow.com/questions/18135693

24-06-2022
|

Question

I have some horrible xml in the following format (anonymized to protect the guilty):

<root>
  <outer attribute="myValue">
    <middle>
      <inner>
        arbitrary text<break />more arbitrary text<break />
      </inner>
    </middle>
  </outer>
  ...
  <outer attribute="myValue">
    <middle>
      <inner>
        arbitrary text<break />more arbitrary text
      </inner>
    </middle>
  </outer>
</root>

The self-closing nodes represent paragraph breaks, while the movement into completely separate outer/middle/inner trees holds no significance at all (and must not result in a paragraph break).

The straightforward XPath expression /*/outer/middle/inner/text() gets me all the text elements, but I no longer know when not to start a new paragraph for a new text node. (the actual expression is nowhere near that simple because of namespace abuse and other cruft, but that's the gist of it).

What would be the best approach here to circumvent this shortcoming and correctly ignore the non-paragraph breaks between text? Is there a way I can capture the break nodes as well and identify them among the text nodes in an order-preserved list?

For additional context, I'm working in Intersystems Cache using the %XML.XPATH.Document API (which wraps standard SAX but may still incur limitations in how sophisticated the approach can be).

Some references:

http://docs.intersystems.com/cache20131/csp/documatic/%25CSP.Documatic.cls?PAGE=CLASS&LIBRARY=%25SYS&CLASSNAME=%XML.XPATH.Document

http://docs.intersystems.com/cache20131/csp/documatic/%25CSP.Documatic.cls?PAGE=CLASS&LIBRARY=%25SYS&CLASSNAME=%25XML.XPATH.ResultHandler

Solution

You probably just want to select the inner element with //outer/middle/inner/. The values in the

%ListOfObjects(CLASSNAME="%XML.XPATH.RESULT")

will be of type %XML.XPATH.DOMResult rather than %XML.XPATH.ValueResult as you have been getting. The %XML.XPATH.DOMResult values will represent a subtree of the DOM that contains both the arbitrary text nodes and the "break" nodes.

The %XML.XPATH.Document class has an Example2 method that sort of illustrates. You might want to play around with a subclass of this that overrides the "ExampleXML" XData block with some more intermediate nodes, and also copies Example2 with an XPATH expression that returns a whole subtree. That should make clear how to approach your actual more complicated problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow