Question

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:

public static void main(String[] args) {
    // Read in an HTML file from disk
    // Retrieve all INPUT elements regardless of whether the HTML is well-formed
    // Loop through all elements and retrieve their ids if they exist for the element
}
Was it helpful?

Solution

HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.

Documentation is here with some code samples; you're basically looking for getElementsByName() method.

Take a look at Comparison of Java HTML parsers if you're considering other libraries.

OTHER TIPS

I've had success using tagsoup. Heres a short description from their home page:

This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Check Jtidy.

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top