NSString
uses UTF-16 to store codepoints internally, so those in the range you're looking for (U+1F300
to U+1F6FF
) will be stored as a surrogate pair (four bytes). Despite its name, characterAtIndex:
(and unichar
) doesn't know about codepoints and will give you the two bytes that it sees at the index you give it (the 55357
you're seeing is the lead surrogate of the codepoint in UTF-16).
To examine the raw codepoints, you'll want to convert the string/characters into UTF-32 (which encodes them directly). To do this, you have a few options:
Get all UTF-16 bytes that make up the codepoint, and use either this algorithm or
CFStringGetLongCharacterForSurrogatePair
to convert the surrogate pairs to UTF-32.Use either
dataUsingEncoding:
orgetBytes:maxLength:usedLength:encoding:options:range:remainingRange:
to convert theNSString
to UTF-32, and interpret the raw bytes as auint32_t
.Use a library like ICU.