XML Invalid characters when creating CData node from UnicodeString

Question 1

If you read Section 2.7 of the XML specification, it describes the format of a CDATA section:

CDATA Sections

[18]    CDSect    ::=    CDStart CData CDEnd  
[19]    CDStart    ::=    '<![CDATA[' 
[20]    CData    ::=    (Char* - (Char* ']]>' Char*))  
[21]    CDEnd    ::=    ']]>'

Char is defined in Section 2.2:

Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

If you look at your raw data, it contains over a dozen character values that are excluded from that range (specifically #x0, #x1, #x2, #x4, #x5, #x6, #x8, #xB #xE, #x18, #x19, #x1A, and #x1C). That is why you are getting errors about illegal characters, because you really do have illegal characters.

A CDATA section does not give you permission to put arbitrary binary data into an XML data. A CDATA section is meant to be used when text content contains characters that are normally reserved for XML markup, so that they do not have to be escaped or encoded as entities. The only way to put binary data into an XML document is to encode it in an XML-compatible (typically 7bit ASCII) format, such as Base64 (but there are other formats available that you can use, such as yEnc).

Question 2

Turns out the problem was indeed all the escape characters present in the raw data string, as suspected.

Solved that by Base64-encoding the entire string before creating the XML CData-sections.

Rad Studio methods: EncodeBase64, DecodeBase64

Header: Soap.EncdDecd.hpp

Question 3

For my situation I created a function to trim a string to just the set of valid XML Characters.

Pseudocode:

//Code released into public domain. No attribution required.
function TrimToXmlText(xmlText: String): string;
begin
   /*
      http://www.w3.org/TR/xml/#NT-Char

      Regarless of entity encoding, the only valid characters allowed are:

         Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

      I.e. any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
      This means that a string such as

         "Line one"#31#10"Line two"

      is invalid (because of the #31 aka 0x1F).

      This means we need to manually strip them out; because the xml library certainly won't do it for us.
   */

   SetLength(Result, Length(xmlText));

   Int32 o = 0;
   for i = 1 to Length(s) do
   begin
      case Ord(s[i]) of
      $9, $A, $D,
      $20..$D7FF,
      $E000..$FFFD:
         begin
            o = o+1;
            Result[o] = xmlText[i];
         end;
      end;
   end;

   SetLength(Result, o);
end;