There are various things you do not seem to know:
- What is a character, and what is an encoding?
- What is Unicode?
- What are various Unicode encodings, how do they differ, what are their strengths and weaknesses, and what is their history?
- What does the XML spec say about encodings?
- How do various operating systems interact with encodings?
- How can binary data be represented visually?
- What does whitespace in XML do?
- …
Basics
This will be just a link to “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky.
TL;DR: Encodings are bijective partial functions that map byte sequences to character sequences and back again. Unicode is a large list of characters which each have a number (the codepoint). Various encodings are used to map these codepoints to bytes:
- ASCII, which can only represent 128 different characters.
- UTF-16, which uses at least two bytes for each codepoint. This can include null bytes. This encoding is ambiguous: In which direction are the bytes read? The byte order marks
0xFEFF
or0xFFFE
sorts this out, and one of them precedes every UTF-16 document. - UTF-8 uses at least one byte for each character, and has the property that ASCII is a subset of UTF-8. It cannot include null bytes (well, except for actual NULs). This encoding has the disadvantage that very high codepoints have large representations. CJK texts can be represented with less bytes in UTF-16 than with UTF-8. With western texts, it is the other way round.
Visual representation of binary data
Some characters (“control characters”) have no printable interpretation. In your hexdump, unprintable bytes are represented with a .
. Emacs and Vim follow the traditional route of prefixing control codes with ^
, which means that it together with the next character represents a control code. ^@
means the NUL character, while ^H
represents the backspace, and ^D
represents the end of a transmission. You get the ASCII value of the control character by subtracting 0x40
from the ASCII character in the visual representation. \377
is the octal representation for 0xFF
.
XML and encodings
The default encoding for XML is UTF-8, because it is backwards-compatible with ASCII. Using any other encoding is unnecessary pain, as is evidenced by this question. Anyway, UTF-16 can be used, if properly declared (which your input tries), but then gets messed up.
The problem with your input.
Your file has the following parts:
- The BOM
0xFFFE
, which means the first byte is the low byte in the input. ASCII characters are then followed by a NUL byte. - The first line of your input (up to byte 0x52 in your hexdump) includes the XML declaration, properly encoded.
- Then, something bad happens: We get the sequence
0d00 0d0a
.0d00
isCR
, the carriage return. The second part was meant to be0a00
, the line feed. Together, they form a Windows line ending. The0d0a
would be an ASCII CRLF. But this is wrong, because UTF-16 is a two-byte encoding. - After that, UTF-16 continues, but now the NUL preceeds each character: The other UTF-16 version! But your editor does not know this, and gives you beautiful chinese characters.
What happened:
Someone printed out the XML preamble, which was encoded in UTF-16le. The
\n
at the end was automatically translated to\r\n
. So0d00 0a00
became0d00 0d0a 00
.This can happen in Perl when you don't decode your input, but encode your output. On Windows, Perl does automatic newline translation, this can be switched off via
binmode $fh
.- The rest of the document as printed out in a single line, so no further translations happened. Because the addition of a single byte shifted everything, the interpretation drastically changed.
If your script could fix this error, then it made the same mistake in reverse (translating \r\n
to \n
, and then decoding it).
Such errors can be avoided by decoding all input directly, and encoding it again before you print. Internally, always operate on codepoints, not bytes. In Perl encodings can be added to a filehandle with binmode
, which performs the de- and encoding transparently.