I can't remove whitespaces from a string parsed by Nokogiri

Question 1

strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.

Removing the character is easy. You can use gsub by providing a regex with the character code:

gsub(/\u00a0/, '')

You could also call

gsub(/[[:space:]]/, '')

to remove all Unicode whitespace. For details, check the Regexp documentation.

Question 2

If I wanted to remove non-breaking spaces "\u00A0" AKA   I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"

So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

tr is extremely fast and easy to use.

An alternate is to pre-process the actual encoded character " " before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "

Using a fixed string for the target is faster than using a regular expression:

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1

Regular expressions are useful if you need their capability, but they can drastically slow code.