How does ruby represent strings internally?

https://stackoverflow.com/questions/11225362

17-06-2021
|

Question

I ran into some trouble while creating a C-Extension for ruby that got me thinking. I wonder how Ruby (1.9.1) handles strings (and all the encoding-stuff) internally?

If I have a string like "o", and I pass the string to a C-Function (as VALUE), I can deal with it pretty easily using the RSTRING_PTR() and the RSTRING_LEN() macro. However, if I make the string ö (a german umlaut character), RSTRING_LEN() will give me 2.

I'm a bit stumped on the contents of RSTRING_PTR() in that case, the two bytes are 0xA4 and 0xC3. What encoding is this? I tried using "ö".force_encoding( ... ) with different encodings before passing the string to the C-function, but that does not affect the contents of RSTRING_PTR at all.

What I need is a way to have the string represented as a WCHAR* encoded in UTF-16 (in the case of "ö", that would be 0x00F6) in my C-function, but that's kinda hard to do if you do not know what encoding you're coming from...

thx for any help in advance

Solution

String internals in ruby 1.9 depends on __ENCODING__ constant and Encoding.default_internal setting.

In your case it looks like UTF-8 (default), but ö is actually c3 b6 in UTF-8, and c3 a4 is ä

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow