Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?

StackOverflow https://stackoverflow.com/questions/8943018

Question

I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.

I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO to THAI DIGIT NINE.

I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.

What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string or char const * (i.e. serialization) which would contain non-arabic-numerals?

I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i).

Was it helpful?

Solution

These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.

A simple function that demonstrates this behavior is GetNumberFormatA

OTHER TIPS

Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.

Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.

To include more information here, the linked-to page states that Microsoft's msvcr100 runtime supports decoding numerals from the following character sets:

  • ASCII
  • Arabic-Indic
  • Extended Arabic
  • Devanagari
  • Bengali
  • Gurmukhi
  • Gujarati
  • Oriya
  • Telugu
  • Kannada
  • Malayalam
  • Thai
  • Lao
  • Tibetan
  • Myanmar
  • Khmer
  • Mongolian
  • Full Width

The full page includes more programming environments and more languages (there are plenty of negatives, too).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top