I'm using https://github.com/rgrove/sanitize
From the README:
Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
That won't work for you because sometimes you want to keep the elements that are unacceptable.
require 'nokogiri'
doc = Nokogiri::HTML(<<END_OF_HTML)
<body>
<p> <br> </p>
<p> <br> </p>
<p> Text </p>
<p> <br> </p>
<p> Text </p>
<p> Text </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
</body>
END_OF_HTML
ps = doc.xpath '/html/body/p'
first_text = -1
last_text = 0
ps.each_with_index do |p, i|
if not p.at_xpath('child::text()').text.strip.empty? #then found some text
first_text = i if first_text == -1
last_text = i
end
end
puts ps.slice(first_text .. last_text)
--output:--
<p> Text </p>
<p> <br></p>
<p> Text </p>
<p> Text </p>