سؤال

I have a rather long text which contains some strings that are inside HTML tags (mostly h1 and h2). I need to remove those completely, which means I need a way to find text that is enclosed in certain HTML tags and then strip these away from the original text.

I tried using gsub but couldn't figure out how to construct a regex or something that makes sense.

هل كانت مفيدة؟

المحلول

Finding and removing nodes is easy:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<h1>foo</h1>
<h2>bar</h2>
<p>This is some text</p>
</body>
</html>
EOT

doc.search('h1, h2').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> 
# >> 
# >> <p>This is some text</p>
# >> </body></html>

I'm using search with a CSS selector h1, h2 which will find all <h1> and <h2> nodes and return them as a NodeSet. A NodeSet is like an array; remove simply walks that NodeSet and removes all its elements.

If you want to look inside the nodes at their text, expand the code a little:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<h1>foo</h1>
<h2>bar</h2>
<h1>baz</h1>
<p>This is some text</p>
</body>
</html>
EOT

doc.search('h1, h2').select{ |n| n.text[/\b(?:foo|bar)\b/] }.map(&:remove)
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> 
# >> 
# >> <h1>baz</h1>
# >> <p>This is some text</p>
# >> </body></html>

text returns the text content of a node. /\b(?:foo|bar)\b/ looks in that text for the words "foo" or "bar". That results in an Array, so I can't use the NodeSet's remove method. Instead, I can pass it into map, which will iterate over each Node that was returned by select, and send Nokogiri::Node.select to it. It's a little more convoluted, but gets there.

XPath selectors could look inside the node's text to replace part of the Ruby code, but they'd be pretty ugly. I prefer to keep it simple.

نصائح أخرى

You cannot use regex to parse HTML (see "RegEx match open tags except XHTML self-contained tags"). You might want to look at an HTML parsing gem like Nokogiri:

require 'nokogiri'

doc = Nokogiri::HTML(my_html)

h1s = doc.css('h1').map(&:text)
h2s = doc.css('h2').map(&:text)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top