Domanda

As we know, UTF-16 is variable-length when there is a character over U+10000.

However, .Net, Java and Windows WCHAR UTF-16 string is treated as if they are fixed-length... What happens if I use over U+10000?

And if they process over U+10000, how do they process? For example, in .Net and Java char is 16bit. so one char cannot process over U+10000..

(.net, java and windows is just example.. I'm talking about how to process over U+10000. But I think I'd rather know how they process over U+10000, for my understanding)


thanks to @dystroy, I know how they process. But there is one problem: If string use UTF-16 surrogate, a random access operation, such as str[3], is O(N) algorithm because any character can be 4-byte or 2-byte! How is this problem treated?

È stato utile?

Soluzione

I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char.

To answer the second part related to random access to unicode points str[3], there are more than one method :

  • charAt is careless and only handle chars in a fast and obvious way
  • codePointAt returns a 32 bits int (but need a char index)
  • codePointCount counts code points

And yes, counting the code points is costly and basically O(N). Here's how it's done in Java :

2665    static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666        int endIndex = offset + count;
2667        int n = 0;
2668        for (int i = offset; i < endIndex; ) {
2669            n++;
2670            if (isHighSurrogate(a[i++])) {
2671                if (i < endIndex && isLowSurrogate(a[i])) {
2672                    i++;
2673                }
2674            }
2675        }
2676        return n;
2677    }

UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char indexes as arguments, not worrying about what kind of rune points they do have behind.

Altri suggerimenti

Usually this problem is not treated at all. Many languages and libraries that use UTF-8 or UTF-16 do substrings or indexes by accessing code units, not code points. That is str[3] will just return the surrogate character in that case. Of course access is constant-time in that case, but for anything outside the BMP (or ASCII) you have to be careful what you do.

If you're lucky there are methods to access code points, e.g. in Java String.codePointAt. And in this case you have to scan the string from the start and determine code point boundaries.

Generally, even accessing code points doesn't gain you very much, though, only at library level. Strings often are used eventually to interact with the user and in that case graphemes or visual string length become more important than code points. And you have even more processing to do in that case.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top