I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char
.
To answer the second part related to random access to unicode points str[3]
, there are more than one method :
- charAt is careless and only handle chars in a fast and obvious way
- codePointAt returns a 32 bits int (but need a char index)
codePointCount
counts code points
And yes, counting the code points is costly and basically O(N)
. Here's how it's done in Java :
2665 static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666 int endIndex = offset + count;
2667 int n = 0;
2668 for (int i = offset; i < endIndex; ) {
2669 n++;
2670 if (isHighSurrogate(a[i++])) {
2671 if (i < endIndex && isLowSurrogate(a[i])) {
2672 i++;
2673 }
2674 }
2675 }
2676 return n;
2677 }
UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char
indexes as arguments, not worrying about what kind of rune points they do have behind.