Finding and removing nodes is easy:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<h1>foo</h1>
<h2>bar</h2>
<p>This is some text</p>
</body>
</html>
EOT
doc.search('h1, h2').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >>
# >>
# >> <p>This is some text</p>
# >> </body></html>
I'm using search
with a CSS selector h1, h2
which will find all <h1>
and <h2>
nodes and return them as a NodeSet. A NodeSet is like an array; remove
simply walks that NodeSet and removes all its elements.
If you want to look inside the nodes at their text, expand the code a little:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<h1>foo</h1>
<h2>bar</h2>
<h1>baz</h1>
<p>This is some text</p>
</body>
</html>
EOT
doc.search('h1, h2').select{ |n| n.text[/\b(?:foo|bar)\b/] }.map(&:remove)
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >>
# >>
# >> <h1>baz</h1>
# >> <p>This is some text</p>
# >> </body></html>
text
returns the text content of a node. /\b(?:foo|bar)\b/
looks in that text for the words "foo"
or "bar"
. That results in an Array, so I can't use the NodeSet's remove
method. Instead, I can pass it into map
, which will iterate over each Node that was returned by select
, and send Nokogiri::Node.select to it. It's a little more convoluted, but gets there.
XPath selectors could look inside the node's text to replace part of the Ruby code, but they'd be pretty ugly. I prefer to keep it simple.