How to represent Unicode characters in an API

https://stackoverflow.com/questions/2293709

21-09-2019
|

Question

This is more an MBCS question than a Unicode question. I need to create an API that returns a list of structs that each instance holds a Unicode character as one of its members. This is in .NET so you'd think I'd want UTF-16, but then for Asian characters, there'd like be two characters required. What's the best practice when returning Unicode characters?

Use an array of 2 UTF-16 chars - Test the 1st char to see if it's surrogate, have a count?
Ignore the surrogate issue and leave it to the caller to figure out the actual glyph encoding spans structs?
Use a string instead so I don't care if it's one or two chars in length?
Use UTF-32

What do people normally do for UTF-8? I'm guessing they never deal with individual characters and everything is held in a string (for example, searching for a character in a string is really done by looking for a sub-string). Maybe it's the C++ programmer in me but a string seems so heavy handed.

I think I'm going to do #3. What have others done?

Solution

You are right about using strings. In Unicode, because even a single character might require multiple codepoints (which would each take a certain number of bytes depending on the encoding), you can't really ever work on anything less than strings. Even functions like isUpper or such should take a string and only work on the first element of it.

The reason a character might require multiple codepoints is typically because of the combining characters, for accents and such.

See this question in the Unicode FAQ.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow