Question

i have a string with bunch of break tags.

unfortunately they are irregular.

<Br> <BR> <br/> <BR/> <br /> etc...

i am using nokogiri, but i dont know how to tell it to break up the string at each break tag....

thanks.

Was it helpful?

Solution

So to implement iftrue's response:

a = 'a<Br>b<BR>c<br/>d<BR/>e<br />f'
a.split(/<\s*[Bb][Rr]\s*\/*>/)
=> ["a", "b", "c", "d", "e", "f"]

...you're left with an array of the bits of the string between the HTML breaks.

OTHER TIPS

If you can break on regular expressions, use the following delimiter:

<\s*[Bb][Rr]\s*\/*>

Explanation:

One left angle bracket, zero or more spaces, B or b, R or r, zero or more spaces, zero or more forward slashes.

To use the regex, look here:
http://www.regular-expressions.info/ruby.html

Pesto's 99% of the way there, however Nokogiri supports creating a document fragment that doesn't wrap the text in the declaration:

 text = Nokogiri::HTML::DocumentFragment.parse('<Br>this<BR>is<br/>a<BR/>text<br />string').children.select {|n| n.text? and n.content } 
puts text
# >> this
# >> is
# >> a
# >> text
# >> string

If you parse the string with Nokogiri, you can then scan through it and ignore anything other than text elements:

require 'nokogiri'
doc = Nokogiri::HTML.parse('a<Br>b<BR>c<br/>d<BR/>e<br />f')
text = []
doc.search('p').first.children.each do |node|
  text << node.content if node.text?
end
p text  # => ["a", "b", "c", "d", "e", "f"]

Note that you have to search for the first p tag because Nokogiri will wrap the whole thing in <!DOCTYPE blah blah><html><body><p>YOUR TEXT</p></body></html>.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top