Question

This is my first attempt at dealing with multiple languages in a program. I would really appreciate if someone could provide me with some study material and how to approach this type of issue.

The question is representing a string which has multiple languages. For example, think of a string that has "Hello" in many languages, all comma separated. What I want to do is to separate these words. So my questions are:

  1. Can I use std::string for this or should I use std::wstring?
  2. If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
  3. Overall, what is the 'accepted' way of handling this type of case?

Thank you.

Was it helpful?

Solution

Can I use std::string for this or should I use std::wstring?

Both can be used. If you use std::string, the encoding should be UTF-8 so as to avoid null-bytes which you'd get if you were to use UTF-16, UCS-2 etc. If you use std::wstring, you can also use encodings that require larger numbers to represent the individual characters, i.e. UCS-2 and UCS-4 will typically be fine, but strictly speaking this is implementation-dependent. In C++11, there is also std::u16string (good for UTF-16 and UCS-2) and std::u32string (good for UCS-4).

So, which of these types to use depends on which encoding you prefer, not on the number or type of languages you want to represent.

As a rule of thumb, UTF-8 is great for storage of large texts, while UCS-4 is best if memory footprint does not matter so much, but you want character-level iterations and position-arithmetic to be convenient and fast. (Example: Skipping n characters in an UTF-8 string is an O(n) operation, while it is an O(1) operation in UCS-4.)

If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.

I would use the same data type for the words as I would use for the text itself. I.e. words of a std::string text should also be std::string, and words from a std::wstring should be std::wstring.

(If there is really a good reason to switch from a string-datatype to a character-pointer datatype, of course char* is right for std::string and wchar_t* is right for std::string. Similarly for the C++11 types, there is char16_t* and char32_t*.)

Overall, what is the 'accepted' way of handling this type of case?

The first question you need to answer to yourself is which encoding you want to use for storage and processing. In highly international settings, only Unicode encodings are truly eligible, but there are still more than one to choose from: UTF-8, UCS-2 and UCS-4 are the most common ones. As described above, which one you choose has implications for memory footprint and processing speed, so think carefully about what types of operations you need to perform. It may be required to convert from one encoding to another at certain points in your program for optimal space and time behavior. Once you know which encoding you want to use in each part of the program, choose the data type accordingly.

Once encoding and data types have been decided, you might also need to look into Unicode normalization. In many languages, the same character (or character/diacritics combination) can be represented by more than one sequence of Unicode code points (esp. when combining characters are used). To deal with these cases properly, you may need to apply Unicode normalizations (such as NFKC) to the strings. Note that there is no built-in support for this in the C++ Standard Library.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top