Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?
-
18-04-2021 - |
Question
I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.
I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO
to THAI DIGIT NINE
.
I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.
What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string
or char const *
(i.e. serialization) which would contain non-arabic-numerals?
I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i)
.
Solution
These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION
and LOCALE_SNATIVEDIGITS
determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.
A simple function that demonstrates this behavior is GetNumberFormatA
OTHER TIPS
Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.
Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.
To include more information here, the linked-to page states that Microsoft's msvcr100
runtime supports decoding numerals from the following character sets:
- ASCII
- Arabic-Indic
- Extended Arabic
- Devanagari
- Bengali
- Gurmukhi
- Gujarati
- Oriya
- Telugu
- Kannada
- Malayalam
- Thai
- Lao
- Tibetan
- Myanmar
- Khmer
- Mongolian
- Full Width
The full page includes more programming environments and more languages (there are plenty of negatives, too).