Question

I'm currently working with the new German ZUGFeRD files. These are PDF A/3 files who have an embedded XML file in them which contains data.

I want to extract this XML file from the PDF A/3 using abcpdf 8.1 with C#.

Any idea how to do this ?

Thanks a lot and regards,

Was it helpful?

Solution

I don't know abcpdf but I guess that the pdf libs offer similar access to the pdfs content.

First take a look at Das-ZUGFeRD-Format_1p0.pdf. Especially page 112. The images shows the object tree you have to traverse in order to find the xml stream.

With this tree you have the names, the types and the direction. Now you can traverse the pdf object tree to get to the XML content that you are looking for.

The steps based on the diagram.

  1. Read your PDF
  2. Get the catalog inside your PDF
  3. Get the Array with name AF from Catalog
  4. Get first element from AF array (should be file spec)
  5. From file spec get the dictionary named EF
  6. Get the stream content of EF

This are the steps you need to perform in order to get to the content.

To display the structure of a pdf and browse the tree I would recommend to use a tool like iText RUPS

OTHER TIPS

What did i do with abcpdf:

  • Get the Objectsoup Array from the Doc (Pretty much an array of all Objects in the Doc)

  • as ZUGFeRD allows only one embedded file inside the PDF, i just searched this objectsoup-array for the one of the type StreamObject that contains /EmbeddedFile

  • Decompress the Stream of that object, get the byte[] of the stream and write it into an xml file

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top