I have Two Options for you (I like 2nd one the most.)
1. http://home.ccil.org/~cowan/XML/tagsoup
instead of parsing well-formed or valid XML,
parses HTML as it is found in the wild:
poor, nasty and brutish, though quite often far from short.
TagSoup is designed for
people who have to process this stuff using
some semblance of a rational application
design. By providing a SAX interface,
it allows standard XML tools to be applied to even the
worst HTML. TagSoup also includes a command-line processor that reads
HTML files and can generate either clean HTML or well-formed XML
that is a close approximation to XHTML.
This the tool we are using. I mentioned another tool but im not using it.
2. http://htmlcleaner.sourceforge.net/download.php
Just download the jar file and unzip it. and Run the jar file like below.
- Go to the Location
- java -jar htmlcleaner-2.8.jar src=http://google.com It will correct missing tags and give output.
Eg - I have Html file with following contents
<table>
<tr>
<td>Wrong Table
it gives the out like below
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>java -jar htmlcleaner-2.8.
jar src=http://localhost/fun/test.html
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:RequiredParentMissing(true) at tr
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at table
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at tbody
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at tr
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at td
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body><table>
<tbody><tr>
<td>Wrong Table</td></tr></tbody></table></body></html>
I tested your html also, The output is
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>java -jar htmlcleaner-2.8.
jar src=http://localhost/fun/test.html
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body>
<p style="margin-top: 0"> dasa </p>
<input size="1" type="text" value="a" />
</body></html>
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>
Thanks.