Question

IDE: Embarcadero XE5 c++ builder.

I'm trying to dump UnicodeStrings in XML CData sections.

Small extract of such a string:

 u"‰PNG\r\n\x1A\n\0\0\0\rIHDR\0\0\0õ\0\0\02\b\x06\0\0\0„\\i\0\0\0\x01sRGB\0®Î\x1Cé\0\0\0\x04gAMA\0\0±\vüa\x05\0\0\0\tpHYs\0\0\x0EÃ\0\0\x0EÃ\x01Ço¨d\0\0\v¼IDATxÚíœypUÕ\x19ÀO\x06…°¤\x04D$ˆ²\b1š\b\x18@...etc"

I know a XML document can contain non-ASCII characters and I thought the content of a XML CData section is not parsed by the XML parser( with the exception of the end-of-section indicator "[[>", which is not present in my data, checked for it ).

When creating(writing) a CData section, I'm still getting the "an invalid character was found in text content when creating node" error.

Code example:

_di_IXMLDocument pXMLDocument = NewXMLDocument("1.0");
// I've played around with the document encoding with no success, guessing it's only applicable while reading the document.
// pXMLDocument->SetEncoding(L"iso-8859-1"); 

String myString;   // Unicode, contains my data string.

// 1st param of CreateNode method is of type UnicodeString.
di_IXMLNode pCDataNode = pXMLDocument->CreateNode( myString, ntCData ); 

Any thoughts on why this is failing? Encoding problem?

Was it helpful?

Solution

If you read Section 2.7 of the XML specification, it describes the format of a CDATA section:

CDATA Sections

[18]    CDSect    ::=    CDStart CData CDEnd  
[19]    CDStart    ::=    '<![CDATA[' 
[20]    CData    ::=    (Char* - (Char* ']]>' Char*))  
[21]    CDEnd    ::=    ']]>' 

Char is defined in Section 2.2:

Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ 

If you look at your raw data, it contains over a dozen character values that are excluded from that range (specifically #x0, #x1, #x2, #x4, #x5, #x6, #x8, #xB #xE, #x18, #x19, #x1A, and #x1C). That is why you are getting errors about illegal characters, because you really do have illegal characters.

A CDATA section does not give you permission to put arbitrary binary data into an XML data. A CDATA section is meant to be used when text content contains characters that are normally reserved for XML markup, so that they do not have to be escaped or encoded as entities. The only way to put binary data into an XML document is to encode it in an XML-compatible (typically 7bit ASCII) format, such as Base64 (but there are other formats available that you can use, such as yEnc).

OTHER TIPS

Turns out the problem was indeed all the escape characters present in the raw data string, as suspected.

Solved that by Base64-encoding the entire string before creating the XML CData-sections.

Rad Studio methods: EncodeBase64, DecodeBase64

Header: Soap.EncdDecd.hpp

For my situation I created a function to trim a string to just the set of valid XML Characters.

Pseudocode:

//Code released into public domain. No attribution required.
function TrimToXmlText(xmlText: String): string;
begin
   /*
      http://www.w3.org/TR/xml/#NT-Char

      Regarless of entity encoding, the only valid characters allowed are:

         Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

      I.e. any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
      This means that a string such as

         "Line one"#31#10"Line two"

      is invalid (because of the #31 aka 0x1F).

      This means we need to manually strip them out; because the xml library certainly won't do it for us.
   */

   SetLength(Result, Length(xmlText));

   Int32 o = 0;
   for i = 1 to Length(s) do
   begin
      case Ord(s[i]) of
      $9, $A, $D,
      $20..$D7FF,
      $E000..$FFFD:
         begin
            o = o+1;
            Result[o] = xmlText[i];
         end;
      end;
   end;

   SetLength(Result, o);
end;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top