what are the various approaches for generating PDFs?

https://stackoverflow.com/questions/4570457

14-10-2019
|

Question

I have an idea for an app that would take some flash content which contains graphics and images like various geometric shapes and polygons and some random images and convert them to PDF.

Also, since I envision this app to be used my multiple users I want this process to be quick and scalable. One possible solution I could think of is have a small flash client with the capability of assembling the above mentioned graphics and images. Generate some sort of XML, send it to a server running a Java process which could render the PDF using iText.

I was wondering what are the other possible ways to do it or the best practices. Technology isn't an issue; open source or commercial.

I understand that image uploads etc will take variable amount of time so consider that images are readily available. Here are the criterias in terms of what I am looking for in a solution for PDF rendering:

No constraint on the flash client because the PDF render engine.
Scalable to multiple users
Speed and Efficiency
Least amount of serialization / deserialization

I would appreciate if you could share your tech stack idea. Thanks a lot!

PS: I would appreciated if you don't get bogged down my Flash >> XML >> Java approach. I believe it to be one of the many approaches that could be taken.

Solution

If generating the PDF in the browser using Flash is an option, then consider using AlivePdf. If not, then check out XSL:FO, we use it for server side conversion to PDF.

OTHER TIPS

I believe iText generates PDFs in Java code. It may or may not use XML as its data source; POJOs will do just as well.

Another way is XSL-FO. It requires an XML data source and an XSL-FO stylesheet to transform the XML and generate a PDF. Apache's Xalan (or any other XSL-T library) can do it for you.

"Quick" and "scalable" may require more than these. Uploading a lot of images is a process that has its own timescale and optimizations that have nothing to do with PDFs.

There's pdflib for PHP, and FPDF (also for PHP).

So you're also willing to consider other clients? It sounds like you've got a kids drawing app and want to generate something that'll preserve the state of their drawing at the time.

Lets face it, XML isn't that efficient. That's not its purpose. It's both machine and human readable, validatable, etc etc.

Instead, how about a <Canvas> based web page that submitted the state of that canvas to the server in JSON (fewer bytes, and less work to build them). The server can then work in whatever the hell library/language it wants. Lots of JSON->my-language libraries floating around out there.

Your choice in PDF libraries is then limited only by what is you have installed on your server. You also said you wanted to do as little reading/writing as possible.

The most efficient possible setup would be to have a read-only partial PDF already loaded into memory tailored to minimize the impact of canvas changes (including images). Each session would dupe that partial PDF, convert the JSON to PDF graphic commands, and save the PDF.

To minimize structural changes to the PDF you'd want to use Inline Images. No new objects in the PDF means you don't need to change your cross reference table at all (until you add fonts or want to reuse an existing image). You could build the "doc info" dictionary padded with a specific amount of spaces between objects so you could fill it in without changing any byte offsets (which would force you to recompute the xref table).

You may or may not need to mess with the page size... we are just talking about one page here, right?

So the PDF would look something like...

%%PDF-1.6
<3-4 random high order bytes to convince folks that we're a binary stream>
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
endobj
3 0 obj
<</Type/Page/Contents 4 0 R/MediaBox[0 0 612 792]/Parent 2 0 R>>
endobj
5 0 obj
<</Type/DocInfo/Author()  --<insert big whitespace gap here>-- 
/Title() --<ditto>--
/Subject() --<ditto>--
/Keywords() --<ditto>--
/Creator(My app's Name)
/Producer(My pdf library's name)
/CreationDate(encodedDateWhenThisTemplateWasBuilt) D:YYYYMMDDHHMMSS-timeZoneOffset
/ModDate() --<another, smaller whitespace gap>--
>>
4 0 obj
<</Filter/SeveralDifferentFiltersAvailable/Length --<byte length of the stream in this file>-->>
stream

And your template stops there. You'd have a similar "end of the PDF" template that would look something like this:

endstream
endobj
xref
0 6
0000000000 65535 f 
0000000010 00000 n
0000000025 00000 n
0000000039 00000 n
0000000097 00000 n
0000000050 00000 n
trailer
<</Root 1 0 R/Size 6/Info 5 0 R>>
startxref
--<some white space>--
%%EOF

The columns of numbers at the end are all wrong. The first column is the byte offset of that particular object (and I'm not up for counting bytes just now thank you). The second column is largely irrelevant.

PDF filling app will need to know:

The byte offsets of everything you intend to fill in within the first template.
1. All the "doc info" fields, which are all optional by the way. The /Info key and the dictionary it points to are optional for that matter. You could yank 'em if you cared to.
2. the /Length key of the content stream. That needs to be the post-filter byte length of the stream itself.
How to convert the JSON into pdf drawing commands. If you wanted to cheat a bit you could use iText[Sharp]'s PdfContentByte class, use its drawing commands, and then get the finished byte stream and slap that into your PDF. Be sure you use Inline Images or this whole scheme goes right out the window. There are probably other libraries you could gut similarly if you felt the need. Or you could just read up on the PDF spec and roll your own. You'll be sticking to a fairly limited subset of PDF's content syntax.
The byte offset of the word "xref" from the start of the file. You can calculate this: LengthOfInitialTemplate + LengthOfContentStream + OffsetFromStartOf2ndTemplateTo'xref'.
The byte offset of the line below "startxref", which is where you write the aforecalculated byte offset of 'xref'

You're not going to get much more efficient than that. You'd read in your templates once. Read/calculate the byte offsets you needed once.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow