Pergunta

I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have rights to reproduce the book and also have a license to use all necessary fonts. At first I thought that the issue was caused by the fonts not being embedded, but I checked and all fonts appear to be subset embedded. Within the pdf there are over 100 fonts used which have one of the following properties:

TrueType Encoding: Ansi TrueType (CID) Encoding: Identity-H Type 1 (CID) Encoding: Identity-H Type 1 Encoding: Custom

The languages within the book include English, German, Spanish and Italian. In German capitalization is absolutely critical. It tends to lose the uppercase properties more than the lower.

An example of the error would be: WELD -> weld

I am really at a loss at what to do here. I have requested that the owner of the book embed the fonts which he has done as subsets but the problem continues. I have tried saving the pdf file as a postscript and then ran it through distiller which correctly much of the problem, but in some cases resulted in text being replaced with different characters or numbers showing up as skulls. I understand that CID fonts might be contributing to the issue, but I have come across instance where a non CID font had the same result.

What could be causing this issue? Is it that the fonts are subset versus fully embedded? Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction? Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?

Any and all assistance is greatly appreciated.

Foi útil?

Solução

That's indeed funny. The sample PDF provided by the OP indeed visibly contains upper case characters, some of them in upper case only lines, some in mixed case lines, which by Adobe Reader are extracted as lower case characters.

You wonder

What could be causing this issue?

As an example how that happens let's look at Pelle Più bella

In the page content that phrase actually looks like the visual representation in capital letters:

/T1_0 1 Tf
-0.025 Tc 12 0 0 12 379.5354 554.8809 Tm
(PELLE PI\331 BELLA)Tj

Looking at the used font T1_0 (a DIN-Bold subset) we see that it claims to use WinAnsiEncoding which would also indicate an interpretation of those character codes in the page stream as capital letters

But the font also has a ToUnicode mapping, and this mapping maps

<41> <0061> — 'A' → a
<42> <0062> — 'B' → b
<43> <0043> — 'C' → C
<44> <0044> — 'D' → D
<45> <0065> — 'E' → e
<49> <0069> — 'I' → i
<4C> <006C> — 'L' → l
<4D> <004D> — 'M' → M
<4E> <006E> — 'N' → n
<50> <0050> — 'P' → P
<52> <0072> — 'R' → r
<53> <0053> — 'S' → S
<54> <0074> — 'T' → t
<D9> <00F9> — 'Ù' → ù

(I only extracted the mappings from character codes which in WinAnsiEncoding represent capital letters.)

Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?

Sorry, I'm not really into InDesign. But that software being from Adobe I would be surprised if that was a bug in InDesign or its export to PDF. Could it instead be that there are some information in the InDesign file which tag PELLE PIÙ BELLA as Pelle Più bella which InDesign then in the PDF export translates into this ToUnicode mapping?

Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?

In case of your sample document there are three fonts, all of them with an Encoding entry WinAnsiEncoding, all of them being an embedded subset, but only two have such funny ToUnicode mappings, DIN-Medium and DIN-Bold, while Helvetica has no ToUnicode mapping. So it somehow is font related. How exactly I cannot say.

A workaround in case of your sample document would be to remove the ToUnicode mapping from the font dictionaries.

For example using Java and the iText library you can do that like this:

PdfReader reader = new PdfReader(INPUT);
for (int i = 1; i <= reader.getXrefSize(); i++)
{
    PdfObject obj = reader.getPdfObject(i);
    if (obj != null && obj.isDictionary())
    {
        PdfDictionary dic = (PdfDictionary) obj;
        if (PdfName.FONT.equals(dic.getAsName(PdfName.TYPE)))
        {
            dic.remove(PdfName.TOUNICODE);
        }
    }
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(OUTPUT));
stamper.close();
reader.close();

After this manipulation Adobe Reader text extraction results in

PELLE PIÙ BELLA

This obviously only works in situations like the one in your sample document.

If in your other documents there is a mixture of fonts some of which require their respective ToUnicode map for text extraction while others are like the trouble fonts above, you might want to add some extra conditions to the Java code to only remove the map in the buggy font definitions.

Outras dicas

No need to jump through PDF hoops. It isn't even a good text interchange format to begin with.

Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?

Ask the file provider to make an RTF export. This will retain all used fonts and formatting.

Your WELD-weld problem might be because of the font (if it contains both upper- and lowercase mapped to the same glyphs), use of an OpenType feature such as All Capitals, or even something like a badly created text-only stream inside the PDF.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top