Frage

I need to trim empty spaces above and after the last tag with text/content. I want to control the content displayed to the client and not "break" the visual.

<p> <br> </p>   ~> remove
<p> <br> </p>   ~> remove
<p> Text <p>
<p> <br> </p>   ~> should preserve only this of the empty tags
<p> Text </p>
<p> Text </p>
<p> <br> </p>   ~> remove
<p> <br> </p>   ~> remove
<p> <br> </p>   ~> remove

I'm using Sanitize and it has the ability of being passed a transfomer. The documentation shows an example snippet to remove all empty elements.

To remove empty elements before any regular element, I thought I could assign a variable to control when it stops removing the empty tags:

should_remove_empty = true
lambda {|env|
  node = env[:node]
  return unless node.elem?

  unless node.children.any?{|c| c.text? && c.content.strip.length > 0 || !c.text? }
    node.unlink if should_remove_empty
  else
    should_remove_empty = false
  end
}

But now, to remove the tail empty elements, I should iterate it upside down. But Sanitize doesn't give me this ability.

Does anyone know how to do this, or has anyone already implemented it?

War es hilfreich?

Lösung

I'm using https://github.com/rgrove/sanitize

From the README:

Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.

That won't work for you because sometimes you want to keep the elements that are unacceptable.

require 'nokogiri'

doc = Nokogiri::HTML(<<END_OF_HTML) 
<body>
<p> <br> </p>
<p> <br> </p> 
<p> Text </p>
<p> <br> </p> 
<p> Text </p>
<p> Text </p>
<p> <br> </p>  
<p> <br> </p> 
<p> <br> </p>
</body>
END_OF_HTML

ps = doc.xpath '/html/body/p'

first_text = -1
last_text = 0

ps.each_with_index do |p, i|
  if not p.at_xpath('child::text()').text.strip.empty?  #then found some text
    first_text = i if first_text == -1
    last_text = i 
  end
end

puts ps.slice(first_text .. last_text)

--output:--
<p> Text </p>
<p> <br></p>
<p> Text </p>
<p> Text </p>
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top