Question

Why UTF-16 have a reserved range in UCS Database?

UTF-16 is just a way to represent character scalar value using one or two unsigned 16-bits, the layout of these values shouldn't be related to character scalar value because we should apply some algorithm to get the actual character scalar value from such representation.

Let's assume that the reserved range D800-DBFF and DC00-DFFF are not reserved in UCS Database, and there is another representation of UTF-16 that can represent all characters in range 0-7FFF in single unsigned 16-bits and when the high order bit is set then another 16-bit is followed with the remaining bits, and for the byte order mark we will reserve the two possible values and that's it.

If I'm wrong then could you explain it to me.

Thanks

Was it helpful?

Solution

Your proposed scheme is less efficient than the current surrogate pair scheme, which is one problem.

Currently, only 0xD800-0xDFFF (2048 code units) are "out of bounds" as normal characters, leaving 63488 code units mapping to single characters. Under your proposal, 0x8000-0xFFFF (32768) code units are reserved for multi-code-unit code points, leaving only the other 32768 code units for single-code-unit code points.

I don't know how many code points are currently specified in the basic multilingual plane, but I wouldn't be surprised if it were more than 32768, and of course it can grow. As soon as it's more than 32768, there would be more characters which require two code units to be represented under your proposal than in UTF-16 as it stands.

Now I agree that none of this requires UCS to include a reserved range (and it's an ugly mix of meanings, in some ways) - but doing so makes it simple (in code) to map UTF-16 to UCS, while still maintaining a pretty efficient solution.

There are very few downsides of this - there's plenty of space in the UCS, so it's not like reserving this small block means we're going to have significantly less room for future expansion.

Supposition

This bit is an informed guess. You could do the research to find out which characters were used in which versions of Unicode, but I believe it's at least a plausible explanation.

The true reason for this particular block being used is probably historical - for a long time Unicode really was just 16-bit, for everything... and characters were already assigned in the upper ranges (the parts your scheme deems off-limits). By taking a block of 2048 values which weren't previously assigned, all previous valid UCS-2 sequences were preserved as valid UTF-16 sequences with the same meaning, while extending the UCS range beyond the BMP. It's possible that some aspects might be easier if the range had been 0xF800-0xFFFF, but it was too late by then.

OTHER TIPS

Codepoints D800-DFFF are reserved because they cannot be represented as themselvees in the current UTF-16 encoding scheme. Since they fall within the 0000-FFFF range, they would be encoded as-is using one UTF-16 codeunit. If that were allowed, when a processor is decoding/seeking forwards through a UTF-16 sequence and encounters a codeunit in the D800-0xDBFF range, it would have to decide whether that codeunit represents a standalone codepoint or the start of a surrogate pair. The only way to do that would be to look at the next codeunit to see if it is in the DC00-DFFF range or not. Similar when decoding/seeking backwards through a sequence, if a codeunit in the DC00-DFFF range is encountered, look at the next codeunit to see if it is in the D800-DBFF range. That makes decoding/seeking a bit harder, and more error prone.

Un-reserving codepoints DB00-DFFF for actual character use would require a logic change to the UTF-16 encoding scheme to escape those specific codepoints in a different manner that does not cause ambiguity. However, under the current encoding scheme, such a change is not possible, AFAIK. So they remain permanently reserved.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top