I am looking for the best Java lib which I can pass in a URL and have it create an image of what the web page looks like as it would in a browser. I tried out flyingsaucer however it seems like almost every web page breaks it -- it wont even render www.google.com or yahoo.com -- the only site i could get it to render is www.w3c.org!

Thoughts on a better tool to use, or possibly allow flying saucer to be more lax in the xhtml is accepts?

有帮助吗?

解决方案

Flying Saucer fails on many pages since it only allows xhtml (see manual).

But you can use some html libs to "clean" your input an then use FS.

Webesite -> "Cleaner" -> Flying Saucer

Some good and free libs are:

  1. JSoup (personal recommendation)
  2. HtmlCleaner
  3. JTidy (sometimes more strict than needed)
  4. Jericho HTML

其他提示

may be you can try itext.jar

download it from http://itextpdf.com/download.php

about html crawling:

use URL from java library. there are tons of examples about this.

about PDF converting:

If you are using Spring framework, you can use AbstractPdfView class via iText api. this is my favorite example. I think you can easily make use of it.

about image converting:

I recommend this one: http://code.google.com/p/java-html2image/

total:

read html by URL → convert it via iText or java-html2image. I strongly recommend you to do it yourself, not leave it to a certain library.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top