Pregunta

I already posted an xml-utf16 question Emacs displays chinese character if I open xml file but now I would like to unterstand why this kind of problems arises. Maybe, if I have a deeper understanding I can better cope with this kind of problems.

Concretely, I got an xml-file which was encoded with utf16. I openend the file from my windows xp PC with emacs (notepad, firefox) and figure (A) was displayed (firefox says: not well-formed). Obviously, the file was exported with encoding utf16. (B) displays the hexadecimal version. (C) displays the xml files after conversion with emacs (revert-buffer-with-coding-system) to utf-8. I also converted the xml-utf16 file with Perl to utf8. The result is displayed in (D).

enter image description here

My questions:

  1. Obviously, the xml-files was exported with encoding utf-16le. In my understanding, utf-16 is a simpler, older encoding than utf-8. Why utf-8 does not understand this encoding? And why editors displays chinese characters?
  2. If I would like to read the content of the xml file it was suggested to convert it with emacs. What I get is not very readible (C) due to the "@". I thought that encoding-issues is a common task and such editors like emacs could cope with that. Am I wrong or is this problem (inserting "@") due to a bad specification of the xml-file? And why there is a point in the hexadecimal version between characters?
  3. I donwloaded a Perl-code from internet which converts utf16 to utf8. If I convert the original xml file to utf-8 I got figure (D). The good thing is that firefox displays the tree structure of the new xml-file. This is not the case using emacs (D). The whole content is written in one line (with exception of the first line). Indeed, the original file contains no CR or LF. If I want to see the utf16/utf8 xml-file considering the tree-structure it seems that it is my job to write a Perl- or Python-code which considers also the tree-structure by inserting CR/LF or using an appriopriate Perl/Python-package, isn't it?
  4. Why does the exporter which exports the data and produce the xml-file under study does not consider the LF/CR to get readible xml-file when open by an editor? Is this to avoid large file sizes?
  5. There is a debate about utf16 (https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful). There is obviously concern using utf16 and this question was asked about 4 years ago. Why still programmers use utf16? Do I missing something? (I want to suggest to my data-deliverers to use utf8).

Thanks for patience.

¿Fue útil?

Solución

There are various things you do not seem to know:

  • What is a character, and what is an encoding?
  • What is Unicode?
  • What are various Unicode encodings, how do they differ, what are their strengths and weaknesses, and what is their history?
  • What does the XML spec say about encodings?
  • How do various operating systems interact with encodings?
  • How can binary data be represented visually?
  • What does whitespace in XML do?

Basics

This will be just a link to “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky.

TL;DR: Encodings are bijective partial functions that map byte sequences to character sequences and back again. Unicode is a large list of characters which each have a number (the codepoint). Various encodings are used to map these codepoints to bytes:

  • ASCII, which can only represent 128 different characters.
  • UTF-16, which uses at least two bytes for each codepoint. This can include null bytes. This encoding is ambiguous: In which direction are the bytes read? The byte order marks 0xFEFF or 0xFFFE sorts this out, and one of them precedes every UTF-16 document.
  • UTF-8 uses at least one byte for each character, and has the property that ASCII is a subset of UTF-8. It cannot include null bytes (well, except for actual NULs). This encoding has the disadvantage that very high codepoints have large representations. CJK texts can be represented with less bytes in UTF-16 than with UTF-8. With western texts, it is the other way round.

Visual representation of binary data

Some characters (“control characters”) have no printable interpretation. In your hexdump, unprintable bytes are represented with a .. Emacs and Vim follow the traditional route of prefixing control codes with ^, which means that it together with the next character represents a control code. ^@ means the NUL character, while ^H represents the backspace, and ^D represents the end of a transmission. You get the ASCII value of the control character by subtracting 0x40 from the ASCII character in the visual representation. \377 is the octal representation for 0xFF.

XML and encodings

The default encoding for XML is UTF-8, because it is backwards-compatible with ASCII. Using any other encoding is unnecessary pain, as is evidenced by this question. Anyway, UTF-16 can be used, if properly declared (which your input tries), but then gets messed up.

The problem with your input.

Your file has the following parts:

  • The BOM 0xFFFE, which means the first byte is the low byte in the input. ASCII characters are then followed by a NUL byte.
  • The first line of your input (up to byte 0x52 in your hexdump) includes the XML declaration, properly encoded.
  • Then, something bad happens: We get the sequence 0d00 0d0a. 0d00 is CR, the carriage return. The second part was meant to be 0a00, the line feed. Together, they form a Windows line ending. The 0d0a would be an ASCII CRLF. But this is wrong, because UTF-16 is a two-byte encoding.
  • After that, UTF-16 continues, but now the NUL preceeds each character: The other UTF-16 version! But your editor does not know this, and gives you beautiful chinese characters.

What happened:

  1. Someone printed out the XML preamble, which was encoded in UTF-16le. The \n at the end was automatically translated to \r\n. So 0d00 0a00 became 0d00 0d0a 00.

    This can happen in Perl when you don't decode your input, but encode your output. On Windows, Perl does automatic newline translation, this can be switched off via binmode $fh.

  2. The rest of the document as printed out in a single line, so no further translations happened. Because the addition of a single byte shifted everything, the interpretation drastically changed.

If your script could fix this error, then it made the same mistake in reverse (translating \r\n to \n, and then decoding it).

Such errors can be avoided by decoding all input directly, and encoding it again before you print. Internally, always operate on codepoints, not bytes. In Perl encodings can be added to a filehandle with binmode, which performs the de- and encoding transparently.

Otros consejos

Why utf-8 does not understand this encoding?

huh? UTF-8 is an encoding. It doesn't understand encodings. Your editor is what understands encodings, and its understanding of UTF-8, UTF-16le and UTF-16be are not necessarily related.

And why editors displays chinese characters?

The problem in (A) is that your editor is using UTF-16be to decode a document encoded using UTF-16le.

Am I wrong or is this problem (inserting "@") due to a bad specification of the xml-file?

The document is correct. The document uses UTF-16le. It specifies it uses UTF-16le by using encoding="utf-16" and a BOM.

The problem in (C) is that your editor is using a single-byte encoding to decode a document encoded using UTF-16le. ^@ represents a NUL.

And why there is a point in the hexadecimal version between characters?

There isn't. The right-most column displays the content of the file decoded using US-ASCII, which this obviously isn't.

If I want to see the utf16/utf8 xml-file considering the tree-structure

... then you'll need an XML viewer/editor that displays XML in that form, not a text editor.

Why does the exporter which exports the data and produce the xml-file under study does not consider the LF/CR to get readible xml-file when open by an editor?

Simpler.

There is a debate about utf16

The exact same problem happens a level up with UTF-8 because people don't handle graphemes correctly. If you handle graphemes correctly, the "problem" with UTF-16 goes away.

As such, rejecting UTF-16 on the basis that it's a variable-width encoding and few expect it be makes no sense to me, because it's the same for UTF-8.

Why still programmers use utf16? Do I missing something?

It's what Windows uses internally.

I want to suggest to my data-deliverers to use utf8

Seems like a rather drastic solution to you incorrectly using UTF-16be instead of UTF-16le in emacs.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top