Question

I am working on a product where I need to parse a HTML document. I looked for Jericho, TagSoup, Jsoup and Crawl4J. Which parser should I use to parse HTML as I need to run this process in multi thread environment using quartz?

At a time if 10 thread run in memory, then I need an API which consumes less memory. In jericho, I read somewhere that it is text based search API and consumes less memory. Am I right? Or I should go for other and why?

Was it helpful?

Solution

Test them out and check their memory footprint. It's hard to make predictions on memory profiles without knowing and testing the HTML you're going to parse.

FFIW, I've used Jsoup in a number of different systems and I find that it works really well. I have never noticed any rampant memory issues with it either.

OTHER TIPS

I"m using JSoup and I'm very impressed. It's wicked fast at parsing, and it's CSS style pattern matching of content is much easier to maintain than XPath.

I tried Validator.nu's parser first, and found it very lacking. The documentation is very thin and I couldn't get it to properly execute XPaths that worked fine in Chrome.

Also, check out this question: Which HTML Parser is the best?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top