Convert to PDF/A and check compliance under Linux [closed]

https://stackoverflow.com/questions/464539

19-08-2019
|

Question

I am working on an online portal, where researchers can upload their research papers. One requirement is, that all PDFs are stored in PDF/A-format. As I can't rely on the users to generate PDF/A conforming documents, I need a tool to check and convert standard PDFs into PDF/A format.

What is the best tool you know of?

Price
Quality
Speed
Available APIs

Open-source tools would be prefered, but a search revealed none. iText can create PDF/a, but converting isn't easy to do, as you have to read every page and copy it to a new document, losing all bookmarks and annotations in this process. (At least as far as I know, if you know of an easy solution, let me know).

APIs should be available for either PHP, Java or a command-line-tool should be provided. Please do not list either GUI-only or Online-only solutions.

Solution

I am not sure all your goals can be satisfied at the same time. The story around PDF/A is a lot more complex than format conversions like tiff to png.

The base format is PDF 1.4: what to do with higher versioned documents which use features from those higher versions? Information might be lost.
In both PDF/A-1a and 1b, metadata in XMP/RDF format is mandatory. If the original document is without metadata, you'll have to get it from somewhere and add it. At least iText can do that.
There are lots of little details to get right, from embedding fonts to making sure spaces are present instead of only horizontal movement commands.

To sum it all up: I think you are better off placing some or all of the responsibility for compliance with the producers of the PDFs. Of course, that doesn't mean you can't help them: If you figure out which tools the majority use to create their papers, you can point to documentation about PDF/A and the specific tools. (as a bit of an extreme example of such documentation, have a look at this)

Good luck with your efforts.

OTHER TIPS

I used to work for the French National Library, to build an archive system that did this kind of things. As most of the top-ten libraries in the world, we used JHOVE to recognize file formats.

JHOVE can tell whether files are PDF/A or not, and it can even validate them. It also knows 7 other kinds of PDF, see the details.

JHOVE is open source, it is maintained by JSTOR and the Harvard University Library. It is rather simple to use.

For the identification part you could try the Droid tool (Digital Record Object Identification), which provides access to the Pronom technical registry (which contains PDF/A).

The Open Office API project might be what your looking for. As of 2.4 Open Office supports PDF/a documents. Here is a code example from the website on how to convert documents, this example is in Java.

I am not sure of PDF/a documents, but you have looked at jodconverter? It can convert many different formats for you, and it is open source. We use it quite extensively in our project.

http://www.artofsolving.com/opensource/jodconverter

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow