How do pack and unpack guesses the character encoding when converting to and from utf8?

https://stackoverflow.com/questions/11456213

20-06-2021
|

Question

Suppose I want to convert "\xBD" to UTF-8.

If I use pack & unpack, I'll get ½:

puts "\xBD".unpack('C*').pack('U*')    #=> ½

as "\xBD" is ½ in ISO-8859-1.

BUT "\xBD" is œ in ISO-8859-9.

My question is: why pack used ISO-8859-1 instead of ISO-8859-9 to convert the char to UTF-8? Is there some way to configure that character encoding?

I know I can use Iconv in Ruby 1.8.7, and String#encode in 1.9.2, but I'm curious about pack because I use it in some code.

Solution

This actually has nothing to do with how \xBD is represented in ISO-8859-x. The critical part is the pack into UTF-8.

The pack receives [189]. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½ and they just chose 189.

Since you're trying to learn more about pack/unpack, let me explain more:

When you unpack with the C directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD translates to 0xBD a.k.a. 189. This is a really basic conversion.

When you pack with the U directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.

pack/unpack have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow