Question

I've created a WebAPI that returns JSON.

The initial data is as follow (UTF-8 encoded):

@text="Rosenborg har ikke h\xC3\xB8rt hva Steffen"

Then with a .to_json on my object, here is what is sent by the API (I think it is ISO-8859-1 encoding) :

"text":"Rosenborg har ikke h\ufffd\ufffdrt hva Steffen"

I'm using HTTParty on the client side, and that's what I finally get :

"text":"Rosenborg har ikke h��rt hva"

Both WebAPI and client app are using Ruby 1.9.2 and Rails 3.

I'm a bit lost with this encoding issue... I tried to add the utf8 encoding header to my ruby files but it didn't changed anything. I guess that I'm missing an encoding / decoding part somewhere... anyone has an idea?

Thank you very much !!! Vincent

Was it helpful?

Solution

In Ruby 1.9, encoding is explicit now. However, Rails may or may not be configured to send the responses in the encoding you expect. You'll have to set the global configuration setting:

Encoding.default_external = "utf-8".

I believe the encoding that Ruby specifies by default for serialization is the platform default. In America on Windows that would be CodePage-1251. Other countries would have an alternate encoding.

Edit: Also see this url if the json is executed against MySQL: https://rails.lighthouseapp.com/projects/8994/tickets/5210-encoding-problem-in-json-format-response

Edit 2: Rails core and its suite of libraries (ActiveRecord, et. al.) will respect the Encoding.default_external configuration setting which encodes all the values it sends. Unfortunately, because encoding is a relatively new concept to Ruby not every 3rd party library has been adjusted for proper encoding. The ones that have may require additional configuration settings for those libraries. This includes MySQL, and the RSolr library you were using.

In all versions of Ruby before the 1.9 series, a string was just an array of bytes. When you've been thinking like that for so long, it's hard to wrap your head around the concept of multiple string encodings. The thing that is even more confusing now is that unlike Java, C#, and other languages that use some form of UTF as the native string format, Ruby allows each string to be encoded differently. In retrospect, that might be a mistake, but at least now they are respecting encoding.

The Encoding.force_encoding method is designed to treat the byte sequence with that new encoding, but does not change any of the underlying data. So it is possible to have invalid byte sequences. There is another method called .encode() that will transform the bytes from one encoding to another and guarantees valid byte sequences. For more information read this:

http://blog.grayproductions.net/articles/ruby_19s_string

OTHER TIPS

Ok, I finally found out what the problem is...

I'm using RSolr to get my data from Solr, and by default encoding for all results is unfortunately 'US-ASCII' as mentioned here (and checked by myself) : http://groups.google.com/group/rsolr/browse_thread/thread/2d4890fa7737e7ef#

So you need to force encoding as follow :

my_string.force_encoding(Encoding::UTF_8)

There is maybe a nice encoding option to provide to RSolr!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top