full list of all subscripts and diacritical marks in unicode

https://stackoverflow.com/questions/8663133

08-04-2021
|

Question

Answered: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt is a a list of all unicode chars, and 0xcc99 # U+0319 COMBINING RIGHT TACK BELOW is somewhat like a comma for a monospaced font..(example: 10̡9̡8̡7̡6̡5̡4̡3̡2̡1̡0̡ )

Is there a complete list of all unicode characters along with their verbal descriptions, e.g. a list of lines like ... 0xcc99 # U+0319 COMBINING RIGHT TACK BELOW ..

Particularly, what diacritical mark do I use to type 1. or 2_o3 ? The motivation is that I want to be able to add a point or comma in a monospace font in a terminal, without actually adding a character.

Solution

There is no complete list of all Unicode characters along with their verbal descriptions, not even a list of them with their Unicode names. The UnicodeData.txt files refers to large ranges of characters generically, e.g.

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCB;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

and

AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

It would be possible to construct a complete list with Unicode names, but what would be the purpose? The Unicode names, such as COMBINING PALATALIZED HOOK BELOW, are identifiers, not descriptions. Taken as English texts, some of them are intuitively descriptive, some are very vague, some are obscure, and some are outright wrong—and will never be changed, due to the stability principle. The principle is largely necessitated by the use of Unicode names in programs; they must not be changed, for the same reasons why the Unicode numbers must not be changed.

Some of the Unicode names for diacritics, too, are misleading or at least incomplete. The shape of a diacritic cannot be inferred from the Unicode name alone, and the shape may even vary a lot (e.g., t with caron is ť in lowercase, with the diacritic looking like a conna, whereas the corresponding uppercase letter Ť has... well, a caron-like caron).

Using characters like U+0319 and U+0321 in your text data implies that will require a relatively extensive font and relatively advanced rendering software that displays combining diacritic marks well. Moreover, if you intend to use them in meanings and contexts they were not intended for (they are meant for use in phonetic notations where they are associated with letters to indicate features of pronunciation), you may need poor software that implements them improperly (considering the intended use and rendering). For example, U+0319 is supposed to appear below a letter

OTHER TIPS

Yes, it's on the CD that comes with TUS, or downloadable from unicode.org: the Unicode Character Database.

"my application is as follows: sometimes I work in command line in xterm with programs that output long numbers I find hard to read. So I want to use diacritics to add dots or commas so that 2938485860 becomes 2.938.485.860 and formatting is preserved. U+0321 is not really good for that...."

If you want to add periods to numbers inline, there's a way to do it. In unicode, there's the set of "Enclosed Alphanumerics", which includes numbers with trailing periods.

2.938.485.860 -> ⒉93⒏48⒌860

Note that in a terminal, these may be unreadable. You could alternatively try

2⑨38④85⑧60 - using circled numbers on every third digit (ugly too)
2̲9384̲8̲5̲860 - using - underlined characters
2𝟵38𝟰85𝟴60 - changing some digits to a "MATHEMATICAL SANS-SERIF BOLD DIGIT"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow