Pergunta

According to its language specification JavaScript has some problems with Unicode (if I understand it correctly), as text is always handled as one character consisting of 16 bits internally.

JavaScript: The Good Parts speaks out in a similar way.

When you search Google for V8's support of UTF-8, you get contradictory statements.

So: What is the state of Unicode support in Node.js (0.10.26 was the current version when this question was asked)? Does it handle UTF-8 will all possible codepoints correctly, or doesn't it?

If not: What are possible workarounds?

Foi útil?

Solução

The two sources you cite, the language specification and Crockford's “JavaScript: The Good Parts” (page 103) say the same thing, although the latter says it much more concisely (and clearly, if you already know the subject). For reference I'll cite Crockford:

JavaScript was designed at a time when Unicode was expected to have at most 65,536 characters. It has since grown to have a capacity of more than 1 million characters.

JavaScript's characters are 16 bits. That is enough to cover the original 65,536 (which is now known as the Basic Multilingual Plane). Each of the remaining million characters can be represented as a pair of characters. Unicode considers the pair to be a single character. JavaScript thinks the pair is two distinct characters.

The language specification calls the 16-bit unit a “character” and a “code unit”. A “Unicode character”, or “code point”, on the other hand, can (in rare cases) need two 16-bit “code units” to be represented.

All of JavaScript's string properties and methods, like length, substr(), etc., work with 16-bit “characters” (it would be very inefficient to work with 16-bit/32-bit Unicode characters, i.e., UTF-16 characters). E.g., this means that, if you are not careful, with substr() you can leave one half of a 32-bit UTF-16 Unicode character alone. JavaScript won't complain as long as you don't display it, and maybe won't even complain if you do. This is because, as the specification says, JavaScript does not check that the characters are valid UTF-16, it only assumes they are.

In your question you ask

Does [Node.js] handle UTF-8 will all possible codepoints correctly, or doesn't it?

Since all possible UTF-8 codepoints are converted to UTF-16 (as one or two 16-bit “characters”) in input before anything else happens, and vice versa in output, the answer depends on what you mean by “correctly”, but if you accept JavaScript's interpretation of this “correctly”, the answer is “yes”.

For further reading and head-scratching: https://mathiasbynens.be/notes/javascript-unicode

Outras dicas

The JavaScript string type is UTF-16 so its Unicode support is 100%. All UTF forms support all Unicode code points.

Here is a general breakdown of the common forms:

  • UTF-8 - 8-bit code units; variable width (code points are 1-4 code units)
  • UTF-16 - 16-bit code units; variable width (code points are 1-2 code units); big-endian or little-endian
  • UTF-32 - 32-bit code units; fixed width; big-endian or little endian

UTF-16 was popularized when it was thought every code point would fit in 16 bits. This was not the case. UTF-16 was later redesigned to allow code points to take two code units and the old version was renamed UCS-2.

However, it turns out that visible widths do not equate very well to memory storage units anyway so both UTF-16 and UTF-32 are of limited utility. Natural language is complex and in many cases sequences of code points combine in surprising ways.

The measurement of width for a "character" depends on context. Memory? Number of visible graphemes? Render width in pixels?

UTF-16 remains in common use because many of today's popular languages/environments (Java/JavaScript/Windows NT) were born in the '90s. It is not broken. However, UTF-8 is usually preferred.

If you are suffering from a data-loss/corruption issue it is usually because of a defect in a transcoder or the misuse of one.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top