Question

(I've seen similar questions, but I think none of them cater to my specific needs, hence...)

I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:

  • figuring out the most prominent color in an HTML chunk
  • changing that color to some other color (hence, has to support modification of the HTML as well)
  • pruning out unwanted tags
  • fixing up the HTML to result in a well formed HTML snippet

Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.

Thanks in advance!

Was it helpful?

Solution

You might want to check out TagSoup:

http://home.ccil.org/~cowan/XML/tagsoup/

OTHER TIPS

Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.

Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up.

You'll need something else for the colour changing stuff.

Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow