Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?

https://stackoverflow.com/questions/8943018

18-04-2021
|

Question

I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.

I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO to THAI DIGIT NINE.

I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.

What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string or char const * (i.e. serialization) which would contain non-arabic-numerals?

I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i).

Solution

These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.

A simple function that demonstrates this behavior is GetNumberFormatA

OTHER TIPS

Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.

Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.

To include more information here, the linked-to page states that Microsoft's msvcr100 runtime supports decoding numerals from the following character sets:

ASCII
Arabic-Indic
Extended Arabic
Devanagari
Bengali
Gurmukhi
Gujarati
Oriya
Telugu
Kannada
Malayalam
Thai
Lao
Tibetan
Myanmar
Khmer
Mongolian
Full Width

The full page includes more programming environments and more languages (there are plenty of negatives, too).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow