How do i read individual xml nodes from a node that contains both CDATA and xml

https://stackoverflow.com/questions/12790752

06-07-2021
|

Pergunta

I have a problem. I have several xml files that randomly contain nodes with both CDATA and reqular xml nodes inside. i need to read the contents of these nodes, but am unsure how to go about determining whether the node is a normal xml node, a CDATA node or a node that contains a mix of both where the CDATA portion at the beginning and end could contain anything. (i'm using xPath to reference my nodes if it helps)

line used to retrieve the textual contents of the node:

contentObj.text = contentNode.selectSingleNode("./text").text;

Example of the xml causing the problem:

<text>
     <![CDATA[<P align=center>&nbsp;</P>
          <P align=center>]]>
     <media identifier="005896523">
          <label>
               <![CDATA[NOTE]]>
          </label>
          <description>
               <![CDATA[Image for NOTE]]>
          </description>
          <comments>Update Required</comments>
     </media>
    <![CDATA[</P>
       <P>&nbsp;</P>
       <P align=left>&nbsp;</P>]]>
</text>

Solução

When you say

contentNode.selectSingleNode("./text")

this returns of course the <text> element node; but when you then ask for the

.text

property of it, you are asking for the text content of the whole <text> element, which is the concatenation of the values of all its descendant text nodes.

If you want to select a single text node, try

contentNode.selectSingleNode("./text/text()[1]").text;

I.e. select the first text node child of the <text> element, then retrieve its text property. That should give you "<P align=center> </P> <P align=center>" (as unparsed text, not XML tree) in your example.

In order to distinguish between CDATA and not-CDATA, you'll have to work around XPath, which is not designed to be able to distinguish between them. XML DOM on the other hand can, at least in some implementations. So you can try

var children = contentNode.selectNodes("./text/node()");

which will select a nodeList of all the children of the <text> element, including text nodes, element nodes, and possibly CDATA nodes. Iterate through the nodes in children and check their nodeType property to see whether it's NODE_CDATA_SECTION, NODE_TEXT, or something else.

Let us know how it goes, and whether you need further help.

Edit

I assume from the fact that you accepted this answer that you were able to get things working, and I'm glad you were able to.

However, I don't want to let this go without emphasizing the caveat that @choroba was alluding to: a CDATA wrapper (around a chunk of text) is invisible to most XML tools (though the text content is visible). The XML data model (described informally here) doesn't know anything about CDATA sections. The standard for XML Infoset explicitly omits information about the boundaries of CDATA marked sections.

So, while you "got lucky" this time, in that you were using XML DOM which does provide information about CDATA sections, it is against the spirit of XML (and therefore unwise) to rely on that information to encode significant data in XML. For that reason, you would be well-served to encode that information some other way. Otherwise, if you ever need to use other XML tools on the data, you could get stuck.

I think the significant information you're trying to extract here is the fact that the text in the CDATA sections is escaped markup. E.g. it's pieces of HTML tags that are not supposed to be (or can't be) part of the XML tree. So you could encode that identification by surrounding each one with a custom element:

<text>
     <escaped><![CDATA[<P align=center>&nbsp;</P>
          <P align=center>]]></escaped>
     <media identifier="005896523">
     ...

Then in order to find these sections in the future, all you have to do is look for elements named <escaped>, which is a simple and natural task for any XML tool.

I don't know whether the design of these XML files is under your control or not. If not, you at least should have the option of sending feedback to the designer. If a designer who is not well-versed in XML things makes a design mistake, it's in their best interests to know about it, so that they might be able to correct it, or at least avoid the same mistake in future designs. If you're working under a chain of command, and the designer of the XML is in a different department, the appropriate route for feedback might be through your supervisor. It's in the department's best interest to know if they're producing non-portable XML designs.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow