Question

The Question: "Is supporting only the Unicode BMP sufficient to enable native Chinese / Japanese / Korean speakers to use an application in their native language?"

I'm most concerned with Japanese speakers right now, but I'm also interested in the answer for Chinese people as well. If an application only supported characters on the BMP - would it make the application unusable for Chinese/Japanese speakers (i.e. app did not allow data entry / display of supplemental characters)?

I'm not asking if the BMP is the only thing you would ever need for any kind of application (clearly not - especially for all language in the entire world). I'm asking for CJK speakers, in a professional context, for a modern kind of ordinary app that deals with general free text entry (including, names, places, etc.) - is the BMP generally enough?

Even if only supporting the BMP is not correct - would it be pretty close / "good enough"? Would the lack of supplemental characters in an application only be an occasional minor inconvenience; or would a Japanese speaker, for example, consider the application completely broken? Especially considering that they would always be able to work around the problem by spelling out problematic words with Hiragana/Katakana?

What about Chinese speakers who don't have a fallback option, would the lack of supplemental characters be considered a show-stopping problem?

I'm considering general professional context here - not social or gaming stuff. As an example, there's a lot of the emoticons on the supplemental planes - but I personally would not consider an English app that did not support Unicode emoticon characters to be "broken", at least for most professional use.

The application I'm dealing with right now is written in Java, but I think this question applies more generally. Knowing the answer will also help me (regardless of language) get a better handle on how much effort I'd have to go through with regard to font support.


EDIT

Clarification: by "supports only the BMP" - I intend that the application would handle supplemental characters gracefully.
Unsupported characters (including the BMP surrogate code blocks) would be dealt with similarly to how most applications deal with ASCII control codes and other undesirable characters - filtered/disallowed for data entry and "dealt with" for display if that were necessary (filtered out or replaced with the unicode replacement character).

Was it helpful?

Solution

For people who might be looking for an actual answer to the actual question: the application that prompted this question is now in production allowing only characters from the BMP (actually a limited subset).

Multiple international customers using Korean language in production - Japanese going live soon. Chinese is in planning (I have my doubts that the BMP will be sufficient for that, but we'll see I guess).

It's fine - no reported issues related to unsupported characters.

But that's just anecdotal evidence, really. Just because my customers were fine with it - that doesn't mean yours will be. For context, customers of the app are international companies, hundreds of employees using the application to process hundreds of thousands of their customers.

OTHER TIPS

Unfortunately CJK support in Unicode is broken. The BMP is not enough to properly support CJK, but worse than that even if you do implement full support for all Unicode pages it is still broken.

The basic problem is that they tried to merge characters from all three languages that look kinda similar but are not really the same. The result is that they only look right if you select the correct font to display them. For example, a particular character will only look right to a Chinese person if you render it with a Chinese font, and only look right to a Japanese person if you render it with a Japanese font.

There is no universal font. There is no way to determine which language a character is supposed to be from, so you have to somehow guess which font to use. You can try to examine the system language or some other hack like that. You can't support two languages in the same document unless you have additional metadata. If you get raw Unicode strings without any indication of what language they are in, you are screwed.

It's a total disaster. You need to talk to your clients to figure out their needs and how they indicate to their systems what font to use for broken Unicode characters.

Edit: Also need to mention, some characters required for people's names are missing from Unicode. Later revisions are better, but of course you also need updated fonts to take advantage of them.

The majority of CJK codepoints are defined in the BMP, however CJK Ideographs are not. So if you do not need to support Ideographs, then the BMP is fine, otherwise it is not.

However, I would consider any implementation that does not recognize and process UTF-16 surrogates, even if it does not handle the Unicode codepoints they represent, to be broken.

Unless you are a fond developer or developing an operating systems you should not care about that, let the OS layer deal with it.

Just implement proper Unicode support in your application and allow the operating system to deal with how the characters are types and displayed.

If you are using custom fonts in your application you may be in trouble

In the end to answer your question: NO, Unicode support is not only BMP and you need to support Unicode.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top