Question

I need to reference to a Unicode character with a URI. Following IANA references list multiple schemes and namespaces but do not mention anything about identifiers for the Unicode characters. Does anyone know if something like this exists already?

I hoped to find something like

  • unicode://U+0394
  • urn:unicode://0394
  • http://unicode.org/unicode/0394

for the greek capital letter delta Δ.

If someone wonders, this is for a semantic web like application that uses URIs as identifiers for concepts, including concepts of the Unicode characters.

Was it helpful?

Solution

I’m afraid there is no URL or URN for referring authoritative information on a Unicode character in general. In the Unicode Standard, information about individual characters is partly in the so-called character database (mostly plain text files in specific formats), partly in the Code Charts (PDF files). Neither of them offers a way to point at an individual character. Moreover, the information there is not exhaustive: there are important remarks on individual characters information scattered around the standard.

The Decodeunicode site has individually addressable items, such as

http://www.decodeunicode.org/en/u+0394

but its information content varies a lot and is generally very limited. It is not official, and it currently contains Unicode 5.0 only.

The Fileformat.info site is much more systematic, but it, too, is unofficial. It is basically limited to formal properties and data derivable from them, plus comments extracted from the Code Charts, plus instructions on typing the character in Windows, plus information about support in fonts—but that’s quite a lot! Example:

http://www.fileformat.info/info/unicode/char/0394/

OTHER TIPS

[ EDIT ] : found this URL matching your needs : http://unicode.org/cldr/utility/character.jsp?a=1F40F

.

Well, there is an URL referencing the authoritative information on the Unicode database, even though it does not describe (as said in the other answer) all the information on one specific character.

You have the following URL, pointing to the latest Unicode database. This is a simple list of existing valid Unicode characters. Some upcoming characters are missing (㋿), and you should expect it to be mutable.

The contents looks like the following, which isn't so practical to use as-is.

$ grep -ai kangaroo UnicodeData.txt -C 7
1F991;SQUID;So;0;ON;;;;;N;;;;;
1F992;GIRAFFE FACE;So;0;ON;;;;;N;;;;;
1F993;ZEBRA FACE;So;0;ON;;;;;N;;;;;
1F994;HEDGEHOG;So;0;ON;;;;;N;;;;;
1F995;SAUROPOD;So;0;ON;;;;;N;;;;;
1F996;T-REX;So;0;ON;;;;;N;;;;;
1F997;CRICKET;So;0;ON;;;;;N;;;;;
1F998;KANGAROO;So;0;ON;;;;;N;;;;;
1F999;LLAMA;So;0;ON;;;;;N;;;;;
1F99A;PEACOCK;So;0;ON;;;;;N;;;;;
1F99B;HIPPOPOTAMUS;So;0;ON;;;;;N;;;;;
1F99C;PARROT;So;0;ON;;;;;N;;;;;
1F99D;RACCOON;So;0;ON;;;;;N;;;;;
1F99E;LOBSTER;So;0;ON;;;;;N;;;;;
1F99F;MOSQUITO;So;0;ON;;;;;N;;;;;

You could build up a hacky « hash-based » namespace with a suffix like this, but that's definitely non-standard.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top