Java library for HTML analysis
-
23-09-2019 - |
Question
(I've seen similar questions, but I think none of them cater to my specific needs, hence...)
I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:
- figuring out the most prominent color in an HTML chunk
- changing that color to some other color (hence, has to support modification of the HTML as well)
- pruning out unwanted tags
- fixing up the HTML to result in a well formed HTML snippet
Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.
Thanks in advance!
Solution
You might want to check out TagSoup:
OTHER TIPS
Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.
Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow