Automatically fixing un-closed HTML tags in Ruby
Question
I'm trying to convert HTML pages to Markdown using the reverse-markdown Ruby gem. Unfortunately it fails with:
/usr/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in parse': #<REXML::ParseException: Missing end tag for 'img' (got "td") (REXML::ParseException)
The source contains some IMG
, INPUT
, etc. tags which end with >
instead of />
.
I've tried the tidy_ffi gem:
doc = Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html,
:numeric_entities => 1,
:output_html => 1,
:merge_divs => 0,
:merge_spans => 0,
:join_styles => 0,
:clean => 1,
:indent => 1,
:wrap => 0,
:drop_empty_paras => 0,
:literal_attributes => 1).clean)
but that made no difference. Any suggestions?
Solution
Reverse-markdown actually assumes the markdown processor produces well-formed XHTML. If yours doesn't, you may want to try the html2markdown gem. It parses using Nokogiri, and is likely more robust (disclaimer: I have not used it).
OTHER TIPS
I made a gem that excerpts html: https://www.ruby-toolbox.com/gems/auto_excerpt maybe you can use that or look at the code it uses to do this? Not sure if that answers the question here.
Actually I just noticed you call Nokogiri::HTML twice: Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html
I'm not sure if the error you're getting is coming from Nokogiri or TifyFFI though.