Question

I'm writing a piece of software in C++ for which it's important that it works correctly with UTF-16 encoding. However, since for most purposes UTF-16 is almost a fixed-with encoding (which it's not), I'm wondering where I could find some strings which I could use to test if it works correctly.

It's mostly useless to test it with latin letters, or even the accented letters of my country, so I'm not sure what kind of characters I should use for testing.

NOTE: the piece of software is a C++ library and I'd like to use UTF-16 for both its API and its internal storage.

Any suggestions are welcome!

Was it helpful?

Solution

The UTF-16 range without surrogate pairs is U+0000 to U+FFFF. Anything from http://www.unicode.org/charts/ above that will do.

If you look at http://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt, this shows the character ranges for the different Unicode blocks, so:

10000..1007F; Linear B Syllabary
10080..100FF; Linear B Ideograms
10100..1013F; Aegean Numbers
10140..1018F; Ancient Greek Numbers
10190..101CF; Ancient Symbols
101D0..101FF; Phaistos Disc
10280..1029F; Lycian
102A0..102DF; Carian
10300..1032F; Old Italic
10330..1034F; Gothic
10380..1039F; Ugaritic
103A0..103DF; Old Persian
10400..1044F; Deseret
10450..1047F; Shavian
10480..104AF; Osmanya
10800..1083F; Cypriot Syllabary
10840..1085F; Imperial Aramaic
10900..1091F; Phoenician
10920..1093F; Lydian
10980..1099F; Meroitic Hieroglyphs
109A0..109FF; Meroitic Cursive
10A00..10A5F; Kharoshthi
10A60..10A7F; Old South Arabian
10B00..10B3F; Avestan
10B40..10B5F; Inscriptional Parthian
10B60..10B7F; Inscriptional Pahlavi
10C00..10C4F; Old Turkic
10E60..10E7F; Rumi Numeral Symbols
11000..1107F; Brahmi
11080..110CF; Kaithi
110D0..110FF; Sora Sompeng
11100..1114F; Chakma
11180..111DF; Sharada
11680..116CF; Takri
12000..123FF; Cuneiform
12400..1247F; Cuneiform Numbers and Punctuation
13000..1342F; Egyptian Hieroglyphs
16800..16A3F; Bamum Supplement
16F00..16F9F; Miao
1B000..1B0FF; Kana Supplement
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D200..1D24F; Ancient Greek Musical Notation
1D300..1D35F; Tai Xuan Jing Symbols
1D360..1D37F; Counting Rod Numerals
1D400..1D7FF; Mathematical Alphanumeric Symbols
1EE00..1EEFF; Arabic Mathematical Alphabetic Symbols
1F000..1F02F; Mahjong Tiles
1F030..1F09F; Domino Tiles
1F0A0..1F0FF; Playing Cards
1F100..1F1FF; Enclosed Alphanumeric Supplement
1F200..1F2FF; Enclosed Ideographic Supplement
1F300..1F5FF; Miscellaneous Symbols And Pictographs
1F600..1F64F; Emoticons
1F680..1F6FF; Transport And Map Symbols
1F700..1F77F; Alchemical Symbols
20000..2A6DF; CJK Unified Ideographs Extension B
2A700..2B73F; CJK Unified Ideographs Extension C
2B740..2B81F; CJK Unified Ideographs Extension D
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
E0100..E01EF; Variation Selectors Supplement

take your pick!

Also, if the text you find is in some other encoding (like UTF-8), you can use a program like iconv to convert it to UTF-16.

OTHER TIPS

Process the text of this wikipedia page. It's got plenty of Cuneiform mixed with Latin-alphabet.

Any character with code point above U+10000 (non-BMP characters) is fine, e.g. text with emoji in it 😊. This is because only non-BMP characters will be encoded as a surrogate-pair i.e. two UTF-16 code units.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top