Question

I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html

But I didn't understand how UTF-8 solves high-endian, low-endian machines thing. For 1byte, its fine. For multi byte, how it works?

Can someone explain better?

Was it helpful?

Solution

Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.

The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).

  • UTF-16BE: 30, 42 = 0011 0000 : 0100 0010
  • UTF-16LE: 42, 30 = 0100 0010 : 0011 0000
  • UTF-8: e3, 81, 82 = 1110 0011 : 10 0000 01 : 10 00 0010

Here is some information on unicode あ

OTHER TIPS

There is no endiannes problem with UTF-8. The problem arises with UTF-16, because there's a need to see a sequence of two-byte chunks as a sequence of byte chunks when writing it into a file or a communication stream, which may have different idea about byte order in a two-byte number. Because UTF-8 works at byte level, there's no need for BOM to be able to parse the sequence correctly on both a big-endian and a little-endian machine. It does not matter if a character is multibyte: UTF-8 defines exactly what order should the characters come, in case of a multi-byte encoding of a codepoint.

The BOM in UTF-8 is for something completely different (well, so the name 'Byte Order Mark' is a litle 'off'). It is to manifest that "this is going to be a UTF-8 stream". UTF-8 BOM is generally unpopular, and many programs do not support it correctly. The site utf8everywhere.org believes it should be deprecated in future.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top