How many bytes do we need to store an arabic character

https://stackoverflow.com/questions/4322191

29-09-2019
|

Question

I'm a little confused about the storage needed for representing an arabic character.

Please let me know if this is true:

in ISO/IEC 8859-6 encoding it takes 2 bytes (http://en.wikipedia.org/wiki/ISO/IEC_8859-6)
in UNICODE it takes 4 bytes (http://en.wikipedia.org/wiki/Arabic_Unicode)

What are the advantages of each encoding? When should we prefer one over another one?

Solution

Well first, Unicode is not an encoding. It is a standard for assigning code points to every character in every language. These code points are integers; how many bytes they take up depends on the specific encoding. The most common Unicode encodings are UTF-8 and UTF-16.

To summarise:

ISO 8859-6 uses 1 byte for each Arabic character, but doesn't support "Arabic presentation forms", nor characters from any other script than ASCII.
UTF-8 uses 2 bytes for each Arabic character, and 3 bytes for "Arabic presentation forms".
UTF-16 uses 2 bytes for each Arabic character, including "Arabic presentation forms".

I will use two examples: 'ح' (U+062D) and 'ﻰ' (U+FEF0). Those numbers are hexadecimal codes representing the Unicode code point of each of those characters.

In ISO 8859-6, most Arabic characters take up just a single byte, since that encoding is dedicated to Arabic. For example, the character 'ح' (U+062D) is encoded as the single byte "CD", as you can see from the table on the Wikipedia article. The character 'ﻰ' (U+FEF0) is listed as an "Arabic Presentation Form", so I suppose that explains why it doesn't appear in ISO 8859-6 at all (you can't encode this character in that encoding).

There are two very common Unicode encodings which let you encode all characters: UTF-8 and UTF-16. They have slightly different uses. UTF-8 uses one byte for ASCII characters, between 2 and 3 bytes for basic characters (including all of Arabic) and 4 bytes for other characters. UTF-16 uses two bytes for basic characters, and 4 bytes for other characters. So basically, if you are using lots of ASCII, UTF-8 is better. For international text, UTF-16 is better.

In UTF-8, 'ح' (U+062D) is encoded as the 2-byte sequence "D8 AD", while 'ﻰ' (U+FEF0) is encoded as the 3-byte sequence "EF BB B0". Basically, characters between U+0080 and U+07FF use 2 bytes, and characters between U+07FF and U+FFFF use 3 bytes. So all the basic Arabic and Arabic supplement characters use 2 bytes, whereas the Arabic Presentation Forms use 3 bytes.

In UTF-16, 'ح' (U+062D) is encoded as the 2-byte sequence "2D 06", while 'ﻰ' (U+FEF0) is encoded as the 2-byte sequence "F0 FE". In UTF-16, all Arabic characters are two bytes. This is further complicated by endianness. Note that the bytes in UTF-16 are just the code points with the two parts swapped around. An equally valid encoding is "06 2D" for the first one, and "FE F0" for the second.

In summary, I would usually recommend UTF-8 as it is unambiguous and supports ASCII text very well. Arabic characters are 2 bytes in either encoding (unless you use "presentation forms"). You can use ISO 8859-6 if you are only using ASCII and Arabic characters, and nothing else, and that will save you some space, but it usually isn't worth it, as it will break as soon as some other characters come along. UTF-8 and UTF-16 support all characters in Unicode.

OTHER TIPS

There are several different unicode encodings, the amount of space used depends on which one you're using: http://unicode.org/faq/utf_bom.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow