Unicode Support in Various Programming Languages

https://stackoverflow.com/questions/1036585

10-07-2019
|

Question

I'd like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not available at all? Is there a resource popular resource for Unicode information in a language? One language per answer please. Also if you could make the language a heading that would make it easier to find.

No correct solution

OTHER TIPS

Perl

Perl has built-in Unicode support, mostly. Sort of. From perldoc:

perlunitut - Tutorial on using Unicode in Perl. Largely teaches in absolute terms about what you should and should not do as far as Unicode. Covers basics.
perlunifaq - Frequently asked questions about Unicode in Perl.
perluniintro - Introduction to Unicode in Perl. Less "preachy" than perlunitut.
perlunicode - For when you absolutely have to know everything there is to know about Unicode and Perl.

Python 3k

Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
Text Vs. Data Instead Of Unicode Vs. 8-bit. See also Unicode HOWTO.

Java

Same as with .NET, Java uses UTF-16 internally: java.lang.String

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

HQ9+

The Q command has complete Unicode support in most implementations.

Delphi

Delphi 2009 fully supports Unicode. They've changed the implementation of string to default to 16-bit Unicode encoding, and most libraries including the third party ones support Unicode. See Marco Cantù's Delphi and Unicode.

Prior to Delphi 2009, the support for Unicode was limited, but there was WideChar and WideString to store the 16-bit encoded string. See Unicode in Delphi for more info.

Note, you can still develop bilingual CJKV application without using Unicode. For example, Shift JIS encoded string for Japanese can be stored using plain AnsiString.

Go

Google's Go programming language supports Unicode and works with UTF-8.

Python

Python 2 has the classes str and unicode. str objects store bytes, unicode objects store UTF-16 characters. Most library functions support both (e.g. os.listdir('.') returns a list of str, os.listdir(u'.') returns a list of unicode objects). Both have encode and decode methods.

Python 3 basically renamed unicode to str. The Python 3 equivalent to str would be the type bytes. bytes has a decode and str an encode method. Since Python 3.3 str objects internally use one of several encodings in order to save memory. For a Python programmer it still looks like an abstract unicode sequence.

Python supports:

encoding/decoding
normalization
simple case conversion and splitting on whitespace
looking up characters by their name

Python does not support/has limited support for:

collation (limited)
special case conversions where there is no 1:1 mapping between lower and upper case characters
regular expressions (it's worked on)
text segmentation
bidirectional text handling

JavaScript

Looks like before JS 1.3 there was no support for Unicode. As of 1.5, UTF-8, UTF-16 and UCS-2 are all supported. You can use Unicode escape sequences in strings, regexs and identifiers. Source

.NET (C#, VB.NET, ...)

.NET stores strings internally as a sequence of System.Char objects. One System.Char represents a UTF-16 code unit.

From the MSDN documentation on System.Char:

The .NET Framework uses the Char structure to represent a Unicode character. The Unicode Standard identifies each Unicode character with a unique 21-bit scalar number called a code point, and defines the UTF-16 encoding form that specifies how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 through 0xFFFF and is stored in a Char structure.

Additional resources:

Strings in .NET and C# (by Jon Skeet).

Tcl

Tcl strings have been sequences of Unicode characters since Tcl 8.1 (1999). Internally, they are morphed dynamically between UTF-8 (strictly the same Modified UTF-8 as Java due to the handling of U+00000 characters) and UCS-2 (in host endianness and BOM, of course). All external strings (with one exception), including those used to communicate with the OS, are internally Unicode before being transformed into whatever encoding is required for the host (or is manually configured on a communications channel). The exception is for where data is copied between two communications channels with a common encoding (and a few other restrictions not germane here) where a direct copy-free binary transfer is used.

Characters outside the BMP are not currently handled either internally or externally. This is a known issue.

R6RS Scheme

Requires the implementation of Unicode 5.1. All strings are in 'unicode format'.

Rust

Rust's strings (std::String and &str) are always valid UTF-8, and do not use null terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc. They can be sliced somewhat like Go using .get since 1.20, with the caveat that it will fail if you try slicing the middle of a code point.

Rust also has OsStr/OsString for interacting with the Host OS. It's byte array on Unix (containing any sequence of bytes). On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly formed Unicode strings that are allowed in Windows and Javascript), &str and String can be freely converted to OsStr or OsString, but require checks to covert the other way. Either by Failing on invalid unicode, or replacing with the Unicode replacement char. (There is also Path/PathBuf, which are just wrappers around OsStr/OsString).

There is also the CStr and CString types, which represent Null terminated C strings, like OsStr on Unix they can contain arbitrary bytes.

Rust doesn't directly support UTF-16. But can convert OsStr to UCS-2 on windows.

Common Lisp (SBCL and CLisp)

According to this, SBCL and CLisp support Unicode.

Objective-C

None built-in, aside from whatever happens to be available as part of the C string library.

However, once you add frameworks…

Foundation (Cocoa and Cocoa Touch) and Core Foundation

NSString and CFString each implement a fully Unicode-based string class (actually several classes, as an implementation detail). The two are “toll-free-bridged” so that the API for one can be used with instances of the other, and vice versa.

For data that doesn't necessarily represent text, there's NSData and CFData. NSString provides methods and CFString provides functions to encode text into data and decode text from data. Core Foundation supports more than a hundred different encodings, including all forms of the UTFs. The encodings are divided into two groups: built-in encodings, which are supported everywhere, and external encodings, which are at least supported on Mac OS X.

NSString provides methods for normalizing to forms D, KD, C, or KC. Each returns a new string.

Both NSString and CFString provide a wide variety of comparison/collation options. Here are Foundation's comparison-option flags and Core Foundation's comparison-option flags. They are not all synonymous; for example, Core Foundation makes literal (strict code-point-based) comparison the default, whereas Foundation makes non-literal comparison (allowing characters with accents to compare equal) the default.

Note that Core Foundation does not require Objective-C; indeed, it was created pretty much to provide most of the features of Foundation to Carbon programmers, who used straight C or C++. However, I suspect most modern usage of it is in Cocoa or Cocoa Touch programs, which are all written in Objective-C or Objective-C++.

C/C++

C

C before C99 has no built in unicode support. It uses zero terminated character arrays (char* or char[]) as strings. A char is specified to by a byte (8 bits).

C99 specifies wcs-functions in additions to the old str-functions (e.g. strlen -> wcslen). These functions take wchar_t* instead of char*. wchar_t stands for wide character type. The size of wchar_t is compiler-specific and can be as small as 8 bits. While different compilers indeed use different sizes, it's usually 16-bit (UTF-16) or 32-bit (UTF-32).

Most C library functions are transparent to UTF-8. E.g. if your operating system supports UTF-8 (and UTF-8 is configured as your systems charset), then creating a file using fopen passing an UTF-8 encoded string will create a properly named file.

C++

The situation in C++ is very similar (std::string -> std::wstring), but there are at least efforts to get some sort of unicode support in the standard library.

D

D supports UTF-8, UTF-16, and UTF-32 (char, wchar, and dchar, respectively). The table with all the types can be found here.

PHP

There is already an entire thread on this on SO!

Ruby

The only stuff I can find for Ruby is pretty old and not being much of a rubist, I'm not sure how accurate it is.

For the record, Ruby does support utf8, but not multibyte. Internally, it usually assumes strings are byte vectors, though there are libraries and tricks you can usually use to make things work.

Found that here.

Ruby 1.9

Ruby 1.9 attaches encodings to strings. Binary strings use the encoding "ASCII-8BIT". While the default encoding is usually UTF-8 on any modern system, you cannot assume that all third party library functions always returns strings in this encoding. It might return any other encoding (e.g. some yaml parsers do that in some situations). If you concatenate two strings of different encoding you might get an Encoding::CompatibilityError.

Arc

Arc doesn't have any unicode support. Yet.

Lua

Lua 5.3 has a built-in utf8 library, which handles the UTF-8 encoding. It allows you to convert a series of codepoints to the corresponding byte sequence and the other way around, get the length (the number of codepoints in a string), iterate over the codepoints in a string, get the byte position of the nth codepoint. It also provides a pattern, to be used by the pattern-matching functions in the string library, that will match one UTF-8 byte sequence.

Lua 5.3 has Unicode code point escape sequences that can be used in string literals (for instance, "\u{61}" for "a"). They translate to UTF-8 byte sequences.

Lua source code can be encoded in UTF-8 or any encoding in which ASCII characters take up one byte. UTF-16 and UTF-32 are not understood by the vanilla Lua interpreter. But strings can contain any encoding, or arbitrary binary data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow