What is an Unusual Octet Order BOM

https://stackoverflow.com/questions/18518730

26-06-2022
|

Question

On the XML documentation and on the different implementations of the Mozilla Universal Character Set Detector (UCSD), there appears a BOM sequence where either the byte order or the word order is reversed, but not both, and they call it 'unusual octet order':

XML docs:

F.1 Detection Without External Encoding Information
...
00 00 FF FE     UCS-4, unusual octet order (2143)
FE FF 00 00     UCS-4, unusual octet order (3412)

Universal Character Set Detector (UCSD) source (just an example):

  if (('\xFF' == aBuf[1]) && ('\x00' == aBuf[2]) && ('\x00' == aBuf[3]))
    // FE FF 00 00 UCS-4, unusual octet order BOM (3412)
    mDetectedCharset = "X-ISO-10646-UCS-4-3412";

  else if (('\x00' == aBuf[1]) && ('\xFF' == aBuf[2]) && ('\xFE' == aBuf[3]))
    // 00 00 FF FE UCS-4, unusual octet order BOM (2143)
    mDetectedCharset = "X-ISO-10646-UCS-4-2143";

Universal Character Set Detector (UCSD) docs:

Known character sets
...
X-ISO-10646-UCS-4-2143
X-ISO-10646-UCS-4-3412

Is there any hardware in existence that uses this endianness, is there such an encoding or an ISO standard for it, is there any popular libs that support encoding/decoding this? Why aren't these sequences just ignored like any other invalid sequence?

Solution

ISO 10646 and Unicode only include big-endian and little-endian UCS-4/UTF-32, not middle-endian. To my knowledge, no software in existence uses these encodings, they are practically irrelevant. Why then does the XML standard mention it? I don't know, but I guess mentioning it was driven by a desire for theoretical completeness rather than any practical value; the same likely applies to character detection/conversion software which includes support for it.

Historically, there have been some systems which have used middle-endian byte order; PDP-11s use the 3412 format to store 32-bit numbers. So if you were to try to process UCS-4/UTF-32 on a PDP-11, the UCS-4-3412 format might be useful. But in practice, no one tries to do that, since PDP-11s were past their heyday by the time Unicode arrived; and since PDP-11s are only 16-bit machines, UCS-4 is not the best Unicode format to use with them.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow