Automatically fixing un-closed HTML tags in Ruby

https://stackoverflow.com/questions/12164071

28-06-2021
|

Question

I'm trying to convert HTML pages to Markdown using the reverse-markdown Ruby gem. Unfortunately it fails with:

/usr/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in parse': #<REXML::ParseException: Missing end tag for 'img' (got "td") (REXML::ParseException)

The source contains some IMG, INPUT, etc. tags which end with > instead of />.

I've tried the tidy_ffi gem:

doc = Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html,
        :numeric_entities => 1,
        :output_html => 1,
        :merge_divs => 0,
        :merge_spans => 0,
        :join_styles => 0,
        :clean => 1,
        :indent => 1,
        :wrap => 0,
        :drop_empty_paras => 0,
        :literal_attributes => 1).clean)

but that made no difference. Any suggestions?

Solution

Reverse-markdown actually assumes the markdown processor produces well-formed XHTML. If yours doesn't, you may want to try the html2markdown gem. It parses using Nokogiri, and is likely more robust (disclaimer: I have not used it).

OTHER TIPS

~~I made a gem that excerpts html: https://www.ruby-toolbox.com/gems/auto_excerpt maybe you can use that or look at the code it uses to do this? Not sure if that answers the question here.~~

Actually I just noticed you call Nokogiri::HTML twice: Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html

I'm not sure if the error you're getting is coming from Nokogiri or TifyFFI though.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow