Question

I am working with text coming from this website with windows-1252 charset. Converting the text to UTF-8 was done using force_encoding, but the text still contains whitespace that I can't get rid of. The whitespace cannot be removed using text.gsub!(/\s/, ' ') or a similar technique.

The iconv gem doesn't do the trick either - as explained here. It is clear that the whitespace is a remnant of the original text and the windows-1252 charset as I get a invalid multibyte char (US-ASCII) warning if I don't specify the encoding as UTF-8.

I'm not an expert of text encoding so I may be overlooking something trivial.

Update: This is the script that I currently use.

#!/bin/env ruby
# encoding: utf-8

require 'rubygems'
require 'nokogiri'
require 'open-uri'

URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
html = Nokogiri.HTML(open(URL))

# Extract Paragraphs
text = ''
html.css('p').each do |p|
  text += p.text
end

# Clean Up Text
text.gsub!(/\s+/, ' ')

puts text

This is a sample of the text that contains invisible characters that I try to remove. The space before the number 16 is what I am referring to.

cobraron aliento para conversar con él.   16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han

Was it helpful?

Solution

Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.

This seems to work for me:

require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')

In general, /[[:space:]]/ should capture more kinds of whitespace that /\s/ (which is equivalent to /[ \t\r\n\f]/), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.

Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:

require 'nokogiri'
require 'open-uri'

URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read            # Separate these three lines to convert  
s.gsub!(' ', ' ')        #  to normal ' ' in source rather than after
html = Nokogiri.HTML(s)       #  conversion to unicode non-breaking space

# Extract Paragraphs
text = ''
html.css('p').each do |p|
  text += p.text
end

# Clean Up Text
text.gsub!(/\s+/, ' ')

puts text

There's now just a single, normal space between the period at the end of 15 and the number 16:

15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.

OTHER TIPS

You can try to use text.strip for removing the whitespaces.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top