Performance iText vs.PdfBox (2014)

Question 1

My question is in what the performance depends, is there a way how to make PdfBox faster?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.

more about how strategies affect performance?

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

Question 2

In the PDFBox - Version 2.0.12, they optimized the PDFunctionType3.eval() by 30%, reduced the RAM requirement of COSOutputStream, and also removed intermediate streams when merging files. All this information is provided in their release notes. Please see the link below for more information:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343489&styleName=Html&projectId=12310760&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7Cddb31610c9c60486ac6cc58a5800069ddf68ccd5%7Clout