I tried to get the same error like you and created this example:
require 'nokogiri'
html = Nokogiri::HTML(<<-html
<td width='400' valign=top>
<b><u>Jenny ID:</u> 8675309</b><br />
Name of Place<br />
Street Address<br />
City, State, Zip<br />
Contact: Jenny Jenny<br />
Phone: 867-5309<br />
Fax:
</td>
html
)
el = html.css('b').first
txt = el.content.split(':').last
puts txt # ' 8675309'
p txt #"\u00A08675309"
p txt.strip #"\u00A08675309"
The leading character is no space, but \u00A0
(The Unicode Character 'NO-BREAK SPACE' (U+00A0)). It seems strip
does not remove it.
If you remove the no-break space explicit, you get the result you want. If you replace \u00A0
with ' '
(a normal space), then you can remove the space with strip without removing it inside the string.
Code:
p txt.gsub("\u00A0", ' ').strip #-> "8675309"
Alternative you can use (thanks to mu is too short)
p txt.gsub(/\p{Space}/, ' ').strip
This requires UTF-8 code. Without you may get an Encoding::CompatibilityError.