Which programming languages were designed with Unicode support from the beginning?

https://stackoverflow.com/questions/1416215

06-07-2019
|

Question

Which widely used programming languages were designed ground-up with Unicode support?

A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?

Solution

Java was probably the first popular language to have ground-up Unicode support.

OTHER TIPS

Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.

There were many breaking changes in Python 3, among them the switch to Unicode for all text.

So Python wasn't designed ground-up for Unicode, but Python 3 was.

I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:

using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט  {
    public תוצאה בית() {
        int אלף = 0;
        for (int λ = 0; λ < 20; λ++) אלף+=λ;
        return אלף;
    }
}

Google's Go programming language supports Unicode and works with UTF-8.

It really is difficult to design Unicode support for the future, in a programming language right from the beginning.

Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.

Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.

Therefore, don't be surprised if different programming languages implement Unicode support differently.

On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.

PS: Support for v5.1 of Unicode is to be made in Java 7.

Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.

The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)

Other languages that support non-ASCII identifiers include current PHP.

Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)

General overview

Unicode operators

Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).

A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.

Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html

Sometimes, a feature that was included in a language when it was first designed is not always the best.

Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.

So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.

With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.

Java uses characters from the Unicode character set.

java and .net languages

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow