PDF and “massive” XMP data storage

https://stackoverflow.com/questions/12315883

30-06-2021
|

Question

So I have a program that creates an output PDF file which I want to make readable (by my program) by embedding metadata into it. And that is quite a lot of data.

It was suggested to me to do do this using XMP format. However, I'm not sure if that is going to work.

If you don't feel like reading all this, skip to the Last paragraph(s), if you don't understand the questions, return here...

My file could have structure like this:

Heading1
<indent>1.Question
<indent><indent>a)answer
<indent><indent>b)answer
<indent>2.Question
<indent><indent>a)answer
<indent><indent>b)answer
<indent><indent>c)answer
<indent>3.Question
<indent>4.Question
Heading2
<indent>1.Question
<indent>2.Question
<indent><indent>a)answer

Every question has it's parent heading and every answer has its parent question. File like this could have unlimited number of headings, unlimited number of questions per heading, and each question zero to five answers.

In order for my program to be able to assemble the same file in it's GUI, it requires several peices of information.

It needs to know:

number of headings (integer)
heading type (boolean) (heading doesn't have to contain only questions, so this is needed, but I omitted the other type of heading in example to simplify the matter)
string containing the text in each heading/question/answer

Following the example, this is how my readable file could look like:

2                  //heading number
Q/4/headingText    //type of heading/number of question/content
2/questionText     //number of answers/content
answerText         //content
answerText         //etc...
3/questionText
answerText
answerText
answerText
0/questionText
0/questionText
Q/2/headingText
0/questionText
1/questionText
answerText

This is possible if I assume that the file is read line by line. First line would tell how many headings to expect, second line (and every header line) would tell the heading type and how many questions to expect before next heading. Question lines would tell how many succesing line contain answer content. Answer lines would only contain content.

All this is to illustrate what I need of my "save file".

Last paragraph(s)

Is all that possible with XMP? Being able to read properties line by line and to have a property with multiple values attached to it, or at least to somehow divide it to couple of properties in a way that could be implemented to keep this functionality?

And the most important question is, could XMP readers/writers (iText) handle the non-fixed size of the XMP file?

My alternative is to simply attach those lines somwhere at the end of PDF file (not to mess up cross-reference table), and comment them out (using %), then create a special reader in Java that would seek for, and parse those lines.

Solution

This is how I interpret your question.

You want to create a PDF that is readible by humans and that renders header texts, questions, and possible answers.

At the same time, you want the PDF to be readible by a program that doesn't know anything about PDF. The content read by the program is different from the content as it can be read by humans, in the sense that it has some structure.

I don't see the link with PDF. I would store the data you want to be machine-readible as an attachment to the PDF, and have your program extract that attachment. If your program can use iText, then it's a piece of cake. If your program can only read bytes, then you could try different options:

(1) store the data as a stream that isn't compressed. Find the uncompressed stream by adding some kind of long recognizable String as the first line of data (that's more or less how an XMP stream is detected by software that can't interpret PDF syntax).

(2) store the data as a compressed stream, but add an extra entry to the stream dictionary of the compressed stream. Loop over the objects in the PDF file, look for a stream dictionary with that specific, custom key/value pair, read the stream and uncompress it.

If I misinterpreted your question, please rephrase.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow