Java library for HTML analysis
(I've seen similar questions, but I think none of them cater to my specific needs, hence...)
I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:
- figuring out the most prominent color in an HTML chunk
- changing that color to some other color (hence, has to support modification of the HTML as well)
- pruning out unwanted tags
- fixing up the HTML to result in a well formed HTML snippet
Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.
Thanks in advance!
Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.
Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).