Question

There are so many APIs or SDKs out there that let developers write a PDF converter. PDFLib, TCPDF, DOMPDF, etc.

There are off-the-shelf PDF converters as well, but they don't have all the options I want. So I think maybe it's best to just write my own converter.

If you were to sit down a HTML-to-PDF converter yourself, approx. how long would it take? Does it require you to write a whole HTML parser before getting anywhere?

The main features required in my application are to have custom document sizes, and absolutely positioned divs containing text and images. No iframes.

Was it helpful?

Solution

Here's how you should probably think about this task - you are not so much converting HTML to PDF but rather you are writing a renderer which will render the HTML to PDF.

So if you don't have the shell of an HTML renderer, there's your first step. It should take HTML in and given a "window size" will call a set of methods that you implement to render primitives (draw lines, place images, place text, place links, etc). You will no doubt run into the issue that HTML pages have no fixed height and PDF pages do.

Next, you will need a decent PDF back end. By decent, I mean that it won't explode on large numbers of images, handles resources in a sane way, and so on. It should also have reasonable Unicode support so that if you send it a Unicode string, it will automatically do the PDF machinations to properly render it so you don't have to do that work manually (and trust me, you don't). And then there are links - what are you going to do with those? Ideally, you should track them and figure out if they go to a particular sub-section of the same document (which would become a link with a goto-view action), or if they go out into the web (which would become a link with an open URI action), or if you're converting multiple documents whether you should have a base URI on the document and relative URI's or whether it should be a file cross link etc.

In addition, there's the notion of navigation and document structure. In theory, you should be able to grab <H1> and other header tags and build an outline tree with goto view actions for each.

Other things that you should be aware of - the PDF model takes a resource-based approach to large document components like images, fonts, colos spaces and so on so that they can be shared. Building your renderer with this in mind will typically produce better PDF and use less memory. If your PDF generator allows for this, you should really be able to make a resource for a particular image and write it to the document (or a temp file) early then refer to it by a resource handle when you place it on the page. Other references to the same image would use the handle and take up no more space in the file. Fonts are the same way - if you're using particular fonts it helps to know them up front and to have an engine that will auto-subset them as they get used.

If you have the HTML renderer and the PDF back end, then this task should take you two weeks, maybe three, again assuming that your HTML front end and PDF back end are half reasonable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top