Convert broken html to pdf with XMLWorker using java

https://stackoverflow.com/questions/14816423

09-03-2022
|

Question

I've been learning about iText and its beauty for the pass few days.

I manage to convert HTML source code to PDF successfully. However, I've been wondering if its possible to convert broken html (missing tags, etc) to PDF without XMLWorker throwing an exception just like HTMLWorker used to do. I know XMLWorker is very sensible and only works with correctly written HTML or (X)HTML but since I am getting the html from a second party which most likely will have broken HTML.

I would like to know if there is a way to just convert what's possible and leave the errors floating around just like a browser would do.

Solution

Use TagSoup before passing the broken HTML to iText. It will clean up the broken HTML and return valid X(HT)ML.

TagSoup implements the SAX parser interface. There are some examples on how to use it, but it lacks some "real" documentation.

Probably you will have to serialize the XML again and dump it to a file to feed it to iText, I don't know its interface.

Serializing a SAX stream is possible using XMLWriter. By chance it is already included with TagSoup, so you don't need to add an extra dependency.

final Parser parser = new Parser();
final StringWriter writer = new StringWriter();

parser.setContentHandler(new XMLWriter(writer));
parser.parse(new InputSource(
        new URL("http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html")
                .openConnection().getInputStream()));
System.out.println(writer.toString());

Decide based on iText's API whether to dump writer's output to a file or pass it another way.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow