سؤال

I wanted to do retrieve unicode representation in hex for characters. For an example, for the character €, the value should be 0x0080. I need to do this only for the ISO 8859-1, the first 256 characters of the unicode encoding. So I used casting to unsigned char in C++ to do this as follows:

(unsigned char) normal_character

Here, normal_character is of char type. This has worked so far, however is there any caveats that I should be careful of?

Thanks!

EDIT:

I took the character € as an example. It is not in the ISO 8859-1 charset.

هل كانت مفيدة؟

المحلول

The ISO-8859-1 encoding is, by definition, the same as the first 256 codepoints of the Unicode table. So a simple numeric cast is enough. Note, however that Unicode codepoints need at least 32 bits (actually just 21 bits, but... uint21_t does not usually exist):

char ch_iso88591 = 'a';
uint32_t ch_unicode = (uint32_t)(unsigned char)ch_iso88591;

And as you correctly noted in your question, you have to cast it to unsigned char because of the posibility char being signed.

If the original charset would be anything other than ISO-8859-1 (or ASCII, of course) you'd need to use a table. For example, the Windows-1252 is usually confused with ISO-8859-1, but they are somewhat different (see your € example). If you have Windows-1252 then you do need a table. This table is actually quite simple to build, you can copy the values yourself from the Wikipedia article (only the values from 0x80 to 0xFF) are needed, because the 0x00-0x7F range is exactly the same).

نصائح أخرى

ISO 8859-1 does not support the character (Unicode codepoint U+20AC) at all. There is no mapping defined in ISO 8859-1 for that Unicode codepoint. ISO 8859-1 does not define any value for byte octet 0x80, either (most ISO 8859 charsets do not). That codepoint does map to byte octet 0x80 in a few other charsets, such as Windows-1252, but does not do so in all charsets. For example, it maps to 0xA4 instead in ISO 8859-7:2003 and ISO 8859-15. So it is not enough to simply truncate that codepoint value to an 8-bit value. You have to actually map it properly.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top