Question

I'm using Nokogiri to parse an HTML document. A representation of the source code which this question is based upon follows:

<td width='400' valign=top>
  <b><u>Jenny ID:</u>&nbsp;8675309</b><br />
        Name of Place<br />
        Street Address<br />
        City, State, Zip<br />
        Contact: Jenny Jenny<br />
        Phone: 867-5309<br />
        Fax: 
</td>

I'm using a couple delimiters to retrieve the text between Jenny ID: and Name of Place. Using #strip, I'm unable to strip out the leading space.

 > returned_value.inspect
=> " 8675309\r\n                  "
 > returned_value.strip
=> " 8675309"

If I use a test string, #strip does indeed remove the leading and trailing white space(s).

 > test_string = " 11111 "
 > test_tring.strip
=> "11111"

How can I completely strip out this leading space? I suspect it's the &nbsp but I cannot rid myself of it.

I promise I'm not this dumb in real life, but this problem is beating me down. It's merciless.

Thank you!

Was it helpful?

Solution

I tried to get the same error like you and created this example:

require 'nokogiri'

html = Nokogiri::HTML(<<-html
<td width='400' valign=top>
  <b><u>Jenny ID:</u>&nbsp;8675309</b><br />
        Name of Place<br />
        Street Address<br />
        City, State, Zip<br />
        Contact: Jenny Jenny<br />
        Phone: 867-5309<br />
        Fax: 
</td>
html
)

el = html.css('b').first
txt = el.content.split(':').last
puts txt    # ' 8675309'
p txt         #"\u00A08675309"
p txt.strip #"\u00A08675309"

The leading character is no space, but \u00A0 (The Unicode Character 'NO-BREAK SPACE' (U+00A0)). It seems strip does not remove it.

If you remove the no-break space explicit, you get the result you want. If you replace \u00A0 with ' ' (a normal space), then you can remove the space with strip without removing it inside the string.

Code:

p txt.gsub("\u00A0", ' ').strip   #-> "8675309"

Alternative you can use (thanks to mu is too short)

p txt.gsub(/\p{Space}/, ' ').strip

This requires UTF-8 code. Without you may get an Encoding::CompatibilityError.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top