Question

(Sorry if a newb question...I've done quite a bit of research, honestly...)

I'm writing some Ruby on Rails code to parse RSS/ATOM feeds. My code is throwing-up on on a pesky '£' symbol.

I've been trying the approach of normalizing the description and title fields of the feeds before doing anything else:

descr = self.description.mb_chars.normalize(:kc)

However, when it hits the string with the '£', I'm guessing that mb_chars hits a problem and returns a regular Ruby String object. I get the error:

undefined method `normalize' for #<String:0x5ef8490>

So what is the best process to defensively prep these strings for insertion into the database? (I need to do a bunch of string processing on them as well)

My problem is compounded in that I don't know the format of the feed I'm processing. For instance, I've had some luck with the following line:

descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr

However, when it encounters the '£' it simply truncates everything after that point.

When I display the '£' symbol with the String.inspect function, it displays at '\243'. Failing a method to 'correctly' deal with this symbol, I'd be happy enough to substitute it for another value (like 'GBP'). So help with that code would be appreciated as well.

The feed in question is http://www.dailymail.co.uk/sport/football/index.rss

Was it helpful?

Solution 2

I was missing something pretty basic - I was guessing at the encoding of the feed that was coming in.

So now I'm looking at (a) the charset in the HTTP response headers, then (b) the encoding in the XML declaration in the feed itself.

Once I have the encoding I use iconv to move it into UTF-8.

So far so good.

OTHER TIPS

I've found one solution:

To fix it, I had to define the $KCODE (encoding) for the document:

require 'rubygems'
require 'active_support/all'

$KCODE = 'UTF8'

str = "test ščž"
puts str.parameterize.inspect
puts str.parameterize.to_s

=> # => test-scz

Original post: https://rails.lighthouseapp.com/projects/8994/tickets/3504-string-parameterize-normalize-bug

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top