What is the reason that CDATA even exists?

https://stackoverflow.com/questions/1714209

19-09-2019
|

Question

I often see people asking XML/XSLT related questions here that root in the inability to grasp how CDATA works (like this one).

I wonder - why does it exist in the first place? It's not that XML could not do without it, everything you can put into a CDATA section can be expressed as "native" (XML-escaped).

I appreciate that CDATA potentially makes the resulting document a bit smaller, but let's face it - XML is verbose anyway. Small XML documents can be achieved more easily through compression, for example.

For me, CDATA breaks the clean separation of markup and data since you can have data that looks like markup to the unaided eye, which I find is a bad thing. (This may even be one of the things that encourages people to inadequately apply string processing or regex to XML.)

So: What good reason is there to use CDATA?

Solution

CDATA sections are just for the convenience of human authors, not for programs. Their only use is to give humans the ability to easily include e.g. SVG example code in an XHTML page without needing to carefully replacing every < with < and so on.

That is for me the intended use. Not to make the resulting document a few bytes smaller because you can use < instead of <.

Also again taking the sample from above (SVG code in xhtml) it makes it easy for me to check the source code of the XHTML file and just copy-paste the SVG code out without again needing to back-replace < with <.

OTHER TIPS

PCDATA - parsed character data which means the data entered will be parsed by the parser.

CDATA - the data entered between CDATA elements will not be parsed by the parser.that is the text inside the CDATA section will be ignored by the parser. as a result a malicious user can sent destroying data to the application using these CDATA elements.

CDATA section starts with <![CDATA[ and ends with ]]>.

The only string that cannot occur in CDATA is ]]>.

The only reason why we use CDATA is: text like Javascript code contains lot of <, & characters. To avoid errors, script code can be defined as CDATA, because using < alone will generate an error, as parser interprets it as the start of new element. Similarly & can be interpreted as a start of the character entity by the parser.

I believe that CDATA was intended to allow raw binary data: as long as it doesn't contain "]]>" then anything goes in a CDATA section. This does set it apart from normal XML and should speed up parsing (and negate the necessity for full text encoding, thus giving a second performance boost). Actually it proved quite problematic what with people not escaping the closing sequence and several early parsers being variously broken, so most now just use a text encoding for binary data, making the CDATA section somewhat pointless, yes.

EDIT: note that this answer is in fact wrong, as Tomalak identifies in comments. I've not deleted it because I know there are other people out there who think that raw binary is acceptable in CDATA and this might clear up that little misunderstanding.

I don't know how helpful this will be, but I'll throw this in too:

One of the issues is that there are a couple of distinct camps of XML developers, where some view XML as a representation of data, and some view it in a more document-centric way. (The beauty of XML is that it works well for both.)

Those who view XML as a representation of data--where the XML is often being produced and consumed by tools, and humans only get involved for troubleshooting--will see little value in a CDATA section, because it doesn't make a difference to their tools, whereas those who use XML for more document-centric purposes may find CDATA sections much more useful.

To me CDATA is just another word for lazy. When i started out with XML i used it, but nowadays i always convert data.

The best reason i can come up with is, convenience. Especially when you are using XML as some form of wrapper, to transport data from one system to another, in this case you may end up with the following.

Create XML wrapper
Convert data to XML
Put data inside wrapper
Send XML to receiver
Split XML to XML + Data in XML
Convert Data in XML to Data

Whereas using CDATA would result in not requiring the extra conversion steps.

Another usage could be to embed data without having to care about the different namespaces in the embedded data. But that is not really a great way to use it.

I've found another example of a good way to use CDATA, one that i should have thought of. It's the case when you need to embed code in an XML-file, the code is not supposed to be converted or it will not work and/or will not be easily readable.

MXML demonstrates a great use of CDATA tags. One of the things I like about MXML is it is valid XML, meaning I can do useful things like generate flash widgets programmatically from a different XML file using a transform, and validate MXML against a schema.

CDATA tags are useful in MXML because they to define an ActionScript script block within an MXML file, allowing me to combine an ECMA type scripting language (with > and < and the like) and valid XML in a single file.

EDIT:

I suppose another option to combine MXML and ActionScript would be to combine them in the way you combine HTML and Javascript, and that is to wrap the script in an XML comment tag inside the script block, and the choice to use CDATA instead was made by the developers of the MXML compiler. I suppose the reasoning probably has more to do with editing, as the MXML editor validates your code against a schema to check syntax and provide context help, as well parsing your actionscript code for syntax and context help. Using CDATA in the editor allows it to do both and differentiate between XML comments and script blocks.

When in doubt, check the spec:

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

CDATA sections are really useful when you want to define a schema for some XML but part of it is out of your control and you can't ensure that it will meet the schema or won't break the XML.

I often have to work with legacy systems that have HTML outputs that are often not well formed XHTML, I can attach a schema that ensures that the XML is structered correctly but have a tag that just contains a CDATA section for housing the potentially bad HTML within CDATA.

It's not a common usage but it definitely has it's uses when you don't want other people's lax programming to break your system.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow