Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.
This seems to work for me:
require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')
In general, /[[:space:]]/
should capture more kinds of whitespace that /\s/
(which is equivalent to /[ \t\r\n\f]/
), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s
at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.
Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read # Separate these three lines to convert
s.gsub!(' ', ' ') # to normal ' ' in source rather than after
html = Nokogiri.HTML(s) # conversion to unicode non-breaking space
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
There's now just a single, normal space between the period at the end of 15 and the number 16:
15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.