Unicode BOM for UTF-16LE vs UTF32-LE

https://stackoverflow.com/questions/1929962

20-09-2019
|

Question

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:

FF FE 00 00 00 00 00 00

How can I tell if this file contains:

The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?

Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?

Solution

As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.

A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.

OTHER TIPS

It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^

I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.

However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.

When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.

So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.

To solve my problem, i will swap the first and second character. :-)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow