what actually is PCDATA and CDATA?

https://stackoverflow.com/questions/857876

21-08-2019
|

Question

it seems that a loose definition of PCDATA and CDATA is that

PCDATA is character data, but is to be parsed.
CDATA is character data, and is not to be parsed.

but then someone told me that CDATA is actually parsed or PCDATA is actually not parsed... so it is a bit of a confusion. Does anyone know the real deal is?

Update: I actually added the PCDATA definition on Wikipedia... so don't take that answer too seriously as that's only my rough understanding of it.

Solution

From WIKI:

PCDATA

Simply speaking, PCDATA stands for Parsed Character Data. That means the characters are to be parsed by the XML, XHTML, or HTML parser. (< will be changed to <, <p> will be taken to mean a paragraph tag, etc). Compare that with CDATA, where the characters are not to be parsed by the XML, XHTML, or HTML parser.

CDATA

The term CDATA, meaning character data, is used for distinct, but related purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

OTHER TIPS

Both PCDATA and CDATA are parsed. They are both character data.

They both must only include valid characters. For example if your document encoding is UTF-8, the content of CDATA sections must still be valid UTF-8 characters. So random binary data will probably prevent the document from being well-formed. Also CDATA sections are still parsed, if only to find the end section tag. But other markup-like characters, like <, > and & are ignored and passed as-is by the parser.

OTOH in PCDATA litteral < and & (and ' or " in attribute values) must be escaped, or they will be interpreted as markup. Entities will also be expanded.

So yes, CDATA sections are indeed parsed. I am not sure why you were told that PCDATA is not parsed though.

PCDATA - Parsed Character Data

CDATA - (Unparsed) Character Data

http://www.w3schools.com/XML/xml_cdata.asp

PCDATA is text that will be parsed by a parser. Tags inside the text will be treated as markup and entities will be expanded.
CDATA is text that will not be parsed by a parser. Tags inside the text will not be treated as markup and entities will not be expanded.

By default, everything is PCDATA. In the following example, ignoring the root, will be parsed, and it'll have no content, but one child.

<?xml version="1.0"?>
<foo>
<bar><test>content!</test></bar>
</foo>

When we want to specify that an element will only contain text, and no child elements, we use the keyword PCDATA, because this keyword specifies that the element must contain parsable character data – that is , any text except the characters less-than (<) , greater-than (>) , ampersand (&), quote(') and double quote (").

In the next example, bar is CDATA, and isn't parsed, and has the content "content!".

<?xml version="1.0"?>
<foo>
<bar><![CDATA[<test>content!</test>]]></bar>
</foo>

There are several content models in SGML. The #PCDATA content model says that an element may contain plain text. The "parsed" part of it means that markup (including PIs, comments and SGML directives) in it is parsed instead of displayed as raw text. It also means that entity references are replaced.

Another type of content model allowing plain text contents is CDATA. In XML, the element content model may not implicitly be set to CDATA, but in SGML, it means that markup and entity references are ignored in the contents of the element. In attributes of CDATA type however, entity references are replaced.

In XML #PCDATA is the only plain text content model. You use it if you at all want to allow text contents in the element. The CDATA content model may be used explicitly through the CDATA block markup in #PCDATA, but element contents may not be defined as CDATA per default.

In a DTD, the type of an attribute that contains text must be CDATA. The CDATA keyword in an attribute declaration has a different meaning than the CDATA section in an XML document. In CDATA section all characters are legal (including <,>,&,’ and “ characters) except the “]]>” end tag.

#PCDATA is not appropriate for the type of an attribute. It is used for the type of "leaf" text.

#PCDATA is prepended by a hash (also known as a "hashtag" or octothorp) simply for historical reasons.

Your first definition is correct.

PCDATA is parsed which means that entities are expanded and that text is treated as markup. CDATA is not parsed by an XML parser.

If only elements were set to CDATA by default in the XHTML DTDs, it would save a lot of ugly manual overrides... Why would script blocks contain other elements? If there are such elements, they are handled by the JS interpreter in DOM manipulation actions -- in which case they should still be completely ignored by the XML parser before document insertion and rendering. I suppose it may have been designed to force the use of external script resource files, which is a ultimately a good thing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow