Detect the presence of hidden messages in a StegoPDF

Question

Answering the questions from the comments...

1) Between two lines of a paragraph, is there always the same gap (distance or space)? does the value of this gap is always integer?

No. Each text chunk (which can be as little as one character and as much as one line) can be drawn on the page starting at an arbitrary position. Even the string drawing operations which are defined to advance to the start of the next line and draw the string there can each be preceded by an operation changing the height of a line and, thus, result in lines with different distances.

And no, the advance from line to line (just like all coordinates) is given by floating point values.

You can, therefore, hide information in the advance from line to line. And as floats are used here, this hidden information does not even need to be perceivable.

2) In the same line, and when the text is justified, is there always the same gap between two words ? In other words, does the space is constant into the line? does this gap is an integer?

It is quite common in PDFs to slightly manipulate the distances between many character pairs in a line. Generally this is done for applying kerning which is not done automatically in PDFs. In such a context marginally different gaps between words are not surprising, even in case of text appearing justified.

And these gaps, also, are given as floats (or some sum of products of floats).

3) When a page contains multiples paragraphs, is there always the same gap between two paragraphs? does the value of this gap is an integer ?

As already the distance between lines of a paragraph can differ, cf. your question 1, the distance between paragraphs can differ, too, and is given as a float.

BTW, PDFs do not know the concept of paragraphs. Whether two lines belong to the same or to different paragraphs, does not make a difference in the PDF page descriptions.

4) For each character (lower or upper case), can we establish a statistics on the distance between the previous character and the next character? attention to the space.

What do you mean by this? You certainly can take a PDF (as long as it allows text extraction) and create such statistics.

PS: Including clarifications from comments:

i mean does the gap between an upper case character and lower case character is always the same ? for example : Ac & Pc ,,,does the gap between A and c remains the same as between P and c ?

To begin with, you should be aware that the only thing the PDF page description knows about a character is a width value. When arranging the characters of a single text chunk on the page, PDF reserves this very width for a character followed by a character spacing width before arranging the next character. This character spacing value can be set by means of a special operator. (These values of course are multiplied by the font size (width only), a horizontal scaling factor, and the scaling implied by the current transformation matrix and text matrix in the writing direction).

Thus, the distance between the width reserved for one character and the width reserved for the next in the characters of a single text chunk always is the current character spacing value (before scaling is applied).

In spite of this, though, the distance may appear differently if the character paintings fill their widths to a different degree. This depends on how the afore mentioned width values correspond to the the font files used. These font files may be embedded in the PDF or taken from local computer resources.

Usually the width values are chosen to generate a fairly harmonic appearance without a need to further tweak the distances. It of course is possible, though, to make these value require corrections for a pleasant appearance and to hide information here.

And when the text is justified, does originally the gap between words remains the same (before applying the Kerning process) ?

If justification is generated by means of the character spacing value mentioned above and the word spacing value (which analogously is additionally added to the space character width) and you look at gaps between words in the same chunk or in chunks generated using the same spacing values, the gaps between words (before scaling) are the same.

If justification is done somehow else, distances might differ.

For details on how the glyph displacement is calculated, this is the formula (text and transformation matrices still to be applied):

(section 9.4.4, ISO 32000-1:2008)