Question

This program works OK for PDF files that I have created but I have to get bold and italic info for Stedman's Dictionary.pdf which appears to have a trick to hide this info. Any suggestions will be warmly welcome.

Note: This is a pure voluntary effort for helping some doctor friends.

    package arspdfbox;

    import java.io.*;
    import org.apache.pdfbox.exceptions.InvalidPasswordException;

    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.pdmodel.common.PDStream;
    import org.apache.pdfbox.util.PDFTextStripper;
    import org.apache.pdfbox.util.TextPosition;
    import java.io.IOException;
    import java.util.List;

    public class PrintTextLocations extends PDFTextStripper {

        public PrintTextLocations() throws IOException {
            super.setSortByPosition(true);
        }

        public static void main(String[] args) throws Exception {

            PDDocument document = null;
            try {
                File input = new File("Stedman_Medical_Dictionary.pdf");
                //File input = new File("results/FontExample5.pdf");
                document = PDDocument.load(input);
                if (document.isEncrypted()) {
                    try {
                        document.decrypt("");
                    } catch (InvalidPasswordException e) {
                        System.err.println("Error: Document is encrypted with a password.");
                        System.exit(1);
                    }
                }
                PrintTextLocations printer = new PrintTextLocations();
                List allPages = document.getDocumentCatalog().getAllPages();
                //for (int i = 0; i < allPages.size(); i++) {
                for (int i = 99; i < 100; i++) {
                    PDPage page = (PDPage) allPages.get(i);
                    System.out.println("Processing page: " + i);
                    PDStream contents = page.getContents();
                    if (contents != null) {
                        printer.processStream(page, page.findResources(), page.getContents().getStream());
                    }
                }
            } finally {
                if (document != null) {
                    document.close();
                }
            }
        }

        /**
         * @param text The text to be processed
         */
        @Override /* this is questionable, not sure if needed... */
        protected void processTextPosition(TextPosition text)  {
            System.out.println("String[" + text.getXDirAdj() + ","
                    + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                    + text.getXScale() + " height=" + text.getHeightDir() + " space="
                    + text.getWidthOfSpace() + " width="
                    + text.getWidthDirAdj() + "]" + text.getCharacter());
            System.out.append(text.getCharacter()+" <--------------------------------");
           // System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter());
            System.out.println(text.getFont().getBaseFont()); System.out.println(" Italic="+text.getFont().getFontDescriptor().isItalic()); 
            System.out.println(" Bold="+text.getFont().getFontDescriptor().getFontWeight()); 
            System.out.println(" ItalicAngle="+text.getFont().getFontDescriptor().getItalicAngle()); 
            //try{
            System.out.println(" xxxx="+text.getFont().getFontDescriptor().isFixedPitch());
            //} catch (IOException ioex){}

        }

    }
Was it helpful?

Solution

This program works OK for PDF files that I have created but I have to get bold and italic info for Stedman's Dictionary.pdf which appears to have a trick to hide this info.

Your program works OK for Stedman's Dictionary, too: The textual information on those dictionary style pages in the PDF uses the same font for normal, bold, italics, etc. text. The styles are only present in the overlayed image which is merely... an image and not a source of information for text extraction.

In some detail:

Looking e.g. into the content stream of the 132nd document page (numbered 110, chosen randomly) shows for the following entry

entry for Bal

the following source:

/F1 22 Tf
BT
1 0 0 1 61 2559 Tm
(Bal'four's)Tj
ET
/F1 21.46 Tf
BT
1 0 0 1 210 2559 Tm
(disease')Tj
ET
/F1 24.76 Tf
BT
1 0 0 1 327 2561 Tm
([George)Tj
ET
/F1 22.71 Tf
BT
1 0 0 1 444 2563 Tm
(Williatn)Tj
ET
/F1 23.33 Tf
BT
1 0 0 1 565 2564 Tm
(Balfour,)Tj
ET
/F1 24.76 Tf
BT
1 0 0 1 692 2566 Tm
(English)Tj
ET
/F1 23 Tf
BT
1 0 0 1 94 2525 Tm
(physician,)Tj
ET
/F1 24.09 Tf
BT
1 0 0 1 252 2526 Tm
(1822-1903.])Tj
ET
/F1 25.93 Tf
BT
1 0 0 1 447 2530 Tm
(Chloroma.)Tj
ET

I.e. the same font (F1) is used for each word with no differing styles, merely in different sizes:

  • "Bal'four's" at 22
  • "disease'" at 21.46
  • "[George" at 24.76
  • "Williatn" at 22.71
  • "Balfour," at 23.33
  • "English" at 24.76
  • "physician," at 23
  • "1822-1903.]" at 24.09
  • "Chloroma." at 25.93

(Coordinates are scaled by a factor 0.23945 on the page at hand; PDFBox will, therefore, give you numbers scaled by that factor instead of the listed sizes.)

The reason why you see bold (Bal'four's disease') or italic (Balfour,) text is that this textual information is "rendered" in rendering mode 3, i.e. invisibly, and a scanned image is displayed on top of it. Thus, you do not have any reliable information (short of applying OCR of styled text to that image) on the style of the text.

That been said, those sizes, if one tries to see any correlation at all, seem smaller for bold texts, the dividing line being somewhere between 22 and 22.5 (my impression having looked at three or four dictionary entries). Thus, you might try to derive boldness from small sizes. I wouldn't count on this being a sure thing, though, some bold text might be larger, some non-bold smaller

OTHER TIPS

Try this :

protected void processTextPosition(TextPosition text)  {
    boolean isBold,isItalic;
    String s = null ;

    if (text.getFont().getFontDescriptor() != null )
    {   
                    {
            if (text.getFont().getFontDescriptor().isForceBold() ||  
            text.getFont().getFontDescriptor().getFontWeight() > 680 )
            {
            isBold = true;
           // System.err.println(text.getCharacter()+"==1");
            if (text.toString() == null || text.toString().isEmpty() ||
            text.toString().trim().isEmpty()){
            //  System.err.println(text.getCharacter()+"2");
                s = new StringBuilder().append("").append(text).toString();
                out.print(s);
            }
            s = new StringBuilder().append("").append(text).toString();
            out.print(s);
        }
      }
}

if (text.getFont().getFontDescriptor().isItalic())
        {
            isItalic = true;
        }

    if (text.toString() == null || text.toString().isEmpty() ||
    text.toString().trim().isEmpty()){
        s = new StringBuilder().append("").append(text).toString();
        out.print(s);
    }

}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top