UTF-16 string : how to process over U+10000? [duplicate]

Question 1

I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char.

To answer the second part related to random access to unicode points str[3], there are more than one method :

charAt is careless and only handle chars in a fast and obvious way
codePointAt returns a 32 bits int (but need a char index)
codePointCount counts code points

And yes, counting the code points is costly and basically O(N). Here's how it's done in Java :

2665    static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666        int endIndex = offset + count;
2667        int n = 0;
2668        for (int i = offset; i < endIndex; ) {
2669            n++;
2670            if (isHighSurrogate(a[i++])) {
2671                if (i < endIndex && isLowSurrogate(a[i])) {
2672                    i++;
2673                }
2674            }
2675        }
2676        return n;
2677    }

UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char indexes as arguments, not worrying about what kind of rune points they do have behind.

Question 2

Usually this problem is not treated at all. Many languages and libraries that use UTF-8 or UTF-16 do substrings or indexes by accessing code units, not code points. That is str[3] will just return the surrogate character in that case. Of course access is constant-time in that case, but for anything outside the BMP (or ASCII) you have to be careful what you do.

If you're lucky there are methods to access code points, e.g. in Java String.codePointAt. And in this case you have to scan the string from the start and determine code point boundaries.

Generally, even accessing code points doesn't gain you very much, though, only at library level. Strings often are used eventually to interact with the user and in that case graphemes or visual string length become more important than code points. And you have even more processing to do in that case.