Is there a way to escape a CDATA end token in xml?

https://stackoverflow.com/questions/223652

03-07-2019
|

Question

I was wondering if there is any way to escape a CDATA end token (]]>) within a CDATA section in an xml document. Or, more generally, if there is some escape sequence for using within a CDATA (but if it exists, I guess it'd probably only make sense to escape begin or end tokens, anyway).

Basically, can you have a begin or end token embedded in a CDATA and tell the parser not to interpret it but to treat it as just another character sequence.

Probably, you should just refactor your xml structure or your code if you find yourself trying to do that, but even though I've been working with xml on a daily basis for the last 3 years or so and I have never had this problem, I was wondering if it was possible. Just out of curiosity.

Edit:

Other than using html encoding...

Solution

Clearly, this question is purely academic. Fortunately, it has a very definite answer.

You cannot escape a CDATA end sequence. Production rule 20 of the XML specification is quite clear:

[20]    CData      ::=      (Char* - (Char* ']]>' Char*))

EDIT: This product rule literally means "A CData section may contain anything you want BUT the sequence ']]>'. No exception.".

EDIT2: The same section also reads:

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using "<" and "&". CDATA sections cannot nest.

In other words, it's not possible to use entity reference, markup or any other form of interpreted syntax. The only parsed text inside a CDATA section is ]]>, and it terminates the section.

Hence, it is not possible to escape ]]> within a CDATA section.

EDIT3: The same section also reads:

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]

Then there may be a CDATA section anywhere character data may occur, including multiple adjacent CDATA sections inplace of a single CDATA section. That allows it to be possible to split the ]]> token and put the two parts of it in adjacent CDATA sections.

ex:

<![CDATA[Certain tokens like ]]> can be difficult and <invalid>]]>

should be written as

<![CDATA[Certain tokens like ]]]]><![CDATA[> can be difficult and <valid>]]>

OTHER TIPS

You have to break your data into pieces to conceal the ]]>.

Here's the whole thing:

<![CDATA[]]]]><![CDATA[>]]>

The first <![CDATA[]]]]> has the ]]. The second <![CDATA[>]]> has the >.

You do not escape the ]]> but you escape the > after ]] by inserting ]]><![CDATA[ before the >, think of this just like a \ in C/Java/PHP/Perl string but only needed before a > and after a ]].

BTW,

S.Lott's answer is the same as this, just worded differently.

S. Lott's answer is right: you don't encode the end tag, you break it across multiple CDATA sections.

How to run across this problem in the real world: using an XML editor to create an XML document that will be fed into a content-management system, try to write an article about CDATA sections. Your ordinary trick of embedding code samples in a CDATA section will fail you here. You can imagine how I learned this.

But under most circumstances, you won't encounter this, and here's why: if you want to store (say) the text of an XML document as the content of an XML element, you'll probably use a DOM method, e.g.:

XmlElement elm = doc.CreateElement("foo");
elm.InnerText = "<[CDATA[[Is this a problem?]]>";

And the DOM quite reasonably escapes the < and the >, which means that you haven't inadvertently embedded a CDATA section in your document.

Oh, and this is interesting:

XmlDocument doc = new XmlDocument();

XmlElement elm = doc.CreateElement("doc");
doc.AppendChild(elm);

string data = "<![[CDATA[This is an embedded CDATA section]]>";
XmlCDataSection cdata = doc.CreateCDataSection(data);
elm.AppendChild(cdata);

This is probably an ideosyncrasy of the .NET DOM, but that doesn't throw an exception. The exception gets thrown here:

Console.Write(doc.OuterXml);

I'd guess that what's happening under the hood is that the XmlDocument is using an XmlWriter produce its output, and the XmlWriter checks for well-formedness as it writes.

simply replace ]]> with ]]]]><![CDATA[>

Here's another case in which ]]> needs to be escaped. Suppose we need to save a perfectly valid HTML document inside a CDATA block of an XML document and the HTML source happens to have it's own CDATA block. For example:

<htmlSource><![CDATA[ 
    ... html ...
    <script type="text/javascript">
        /* <![CDATA[ */
        -- some working javascript --
        /* ]]> */
    </script>
    ... html ...
]]></htmlSource>

the commented CDATA suffix needs to be changed to:

        /* ]]]]><![CDATA[> *//

since an XML parser isn't going to know how to handle javascript comment blocks

In PHP: '<![CDATA['.implode(explode(']]>', $string), ']]]]><![CDATA[>').']]>'

A cleaner way in PHP:

   function safeCData($string)
   {
      return '<![CDATA[' . str_replace(']]>', ']]]]><![CDATA[>', $string) . ']]>';
   }

Don't forget to use a multibyte-safe str_replace if required (non latin1 $string):

   function mb_str_replace($search, $replace, $subject, &$count = 0)
   {
      if (!is_array($subject))
      {
         $searches = is_array($search) ? array_values($search) : array ($search);
         $replacements = is_array($replace) ? array_values($replace) : array ($replace);
         $replacements = array_pad($replacements, count($searches), '');
         foreach ($searches as $key => $search)
         {
            $parts = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
         }
      }
      else
      {
         foreach ($subject as $key => $value)
         {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
         }
      }
      return $subject;
   }

Another solution is to replace ]]> by ]]]><![CDATA[]>.

See this structure:

<![CDATA[
   <![CDATA[
      <div>Hello World</div>
   ]]]]><![CDATA[>
]]>

For the inner CDATA tag(s) you must close with ]]]]><![CDATA[> instead of ]]>. Simple as that.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow