free PDF manipulation library or code?

https://stackoverflow.com/questions/12872878

07-07-2021
|

Question

I think of developing a tool for commercial usage (I intent to sell it), which will include manipulating document files.

The manipulations will include: 1. concatenating several PDF files into one. 2. converting doc/docx file into a PDF file. 3. breaking a single PDF file into 2 separated PDF files. 4. numbering the pages of a PDF file (with a sequentially running number).

For that matter, I'm looking for a free library or code to help me with the PDF manipulations. I prefer the library to be in C# because my software will be in C# as it has some GUI, but I'll manage with JAVA library too...

I found the "pdftk" library which can help me a lot, but unfortunately it's license doesn't allow commercial use....

Does anyone have an idea of a free library or code which can help me with that?

Thanks a lot!!

Solution

If you want to manipulate PDF with java, PDFBox is good choice.

Also you can take a look at itextpdf which has support for java and C#. There is community version for the library.

OTHER TIPS

Take a look at pdftotext at http://www.foolabs.com/xpdf/download.html.

It provides an option for extracting the contents of a PDF file into a text file. Where it stands out in comparison to other libraries is that it maintains the formatting from the PDF file in the extracted text file. This is really helpful when your PDF contains structural data such as tables and the PDF files are untagged. PDFBox and other libraries fail to maintain the the structure of the contents of your PDF while parsing it.

Once you have the text file extracted from your PDF, you are free to use your favorite programming language to parse the text file.

Take a look at the license policy here : http://www.glyphandcog.com/Xpdf.html. It clearly states that if you directly use he executables without modifying the source code, you are free to redistribute your application that uses the executables. If performance is not a concern, you don't need to touch their source code.

If performance is a concern, you can create a trial version of your application that highlights the functionality but is naturally slow as it will run the executable everytime you want to process a PDF. The paid version can directly call the pdftotext api and will be faster. You can make up for the money spent on licensing very easily. I would have done this if I were you but I already have some big projects on my plate at the moment :)

I can vouch for pdftotext as I have used it myself. All other libraries seem to forget that the users may be interested in keeping the format of the PDF files as it is.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow