Question

What I have so far is putting the text into CDATA tags, and dealing with the possibility of CDATA endings appearing in the text by splitting it into multiple adjacent CDATAs.

I'm not sure about this, but XML parsers can fail to preserve newlines inside of CDATA tags, correct? This would mean escaping them somehow as well...

I want to generate these XML files using Perl, and parse them with C++ (using expat), Java, and C#.

Most importantly, I want the resulting files to be somewhat human-readable/modifiable. Does anyone know of any encoding scheme that fits these needs? I am using this to store data for a database, so it needs to accept arbitrary text, and upon parsing return the exact same text.

Was it helpful?

Solution

xml already supports this, you do not need to do anything special and you certainly do not need to use CDATA. just use a decent library, make sure you are using UTF-8 encoding, and add a text node. if something is "losing" newlines then it's a bug. xml already has an "encoding" (escaping) that is relatively human readable. it's also standard which makes it much more useful than inventing your own.

see, for example https://stackoverflow.com/a/1140802/181772

OTHER TIPS

You could encode the content, if the content was HTML for example:

<html>&lt;b&gt;Bold Text&lt;/b&gt;</html>

vs.

<html><![CDATA[<b>Bold Text</b>]]></html>

Hmm, as far as I can tell CDATA sections are for character data, and control characters don't count. I assume this means that on the matter of newlines, XML parsers make a judgement call about whether they are a control character or not (historically, yes, but pratically... no.).

While it would impair readability, you can encode newlines using escape sequences, Assuming that you are escaping properly, parsing should convert it properly, you'll just have to make note of it when encoding.

Another option, that completely violates your "human-readable" requirement is to base-64 encode the text, this allows you to encode arbitrary information in the XML.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top