PDFBox text extraction with bold/italic info does not work on some files

Question 1

This program works OK for PDF files that I have created but I have to get bold and italic info for Stedman's Dictionary.pdf which appears to have a trick to hide this info.

Your program works OK for Stedman's Dictionary, too: The textual information on those dictionary style pages in the PDF uses the same font for normal, bold, italics, etc. text. The styles are only present in the overlayed image which is merely... an image and not a source of information for text extraction.

In some detail:

Looking e.g. into the content stream of the 132nd document page (numbered 110, chosen randomly) shows for the following entry

the following source:

/F1 22 Tf
BT
1 0 0 1 61 2559 Tm
(Bal'four's)Tj
ET
/F1 21.46 Tf
BT
1 0 0 1 210 2559 Tm
(disease')Tj
ET
/F1 24.76 Tf
BT
1 0 0 1 327 2561 Tm
([George)Tj
ET
/F1 22.71 Tf
BT
1 0 0 1 444 2563 Tm
(Williatn)Tj
ET
/F1 23.33 Tf
BT
1 0 0 1 565 2564 Tm
(Balfour,)Tj
ET
/F1 24.76 Tf
BT
1 0 0 1 692 2566 Tm
(English)Tj
ET
/F1 23 Tf
BT
1 0 0 1 94 2525 Tm
(physician,)Tj
ET
/F1 24.09 Tf
BT
1 0 0 1 252 2526 Tm
(1822-1903.])Tj
ET
/F1 25.93 Tf
BT
1 0 0 1 447 2530 Tm
(Chloroma.)Tj
ET

I.e. the same font (F1) is used for each word with no differing styles, merely in different sizes:

"Bal'four's" at 22
"disease'" at 21.46
"[George" at 24.76
"Williatn" at 22.71
"Balfour," at 23.33
"English" at 24.76
"physician," at 23
"1822-1903.]" at 24.09
"Chloroma." at 25.93

(Coordinates are scaled by a factor 0.23945 on the page at hand; PDFBox will, therefore, give you numbers scaled by that factor instead of the listed sizes.)

The reason why you see bold (Bal'four's disease') or italic (Balfour,) text is that this textual information is "rendered" in rendering mode 3, i.e. invisibly, and a scanned image is displayed on top of it. Thus, you do not have any reliable information (short of applying OCR of styled text to that image) on the style of the text.

That been said, those sizes, if one tries to see any correlation at all, seem smaller for bold texts, the dividing line being somewhere between 22 and 22.5 (my impression having looked at three or four dictionary entries). Thus, you might try to derive boldness from small sizes. I wouldn't count on this being a sure thing, though, some bold text might be larger, some non-bold smaller

Question 2

Try this :

protected void processTextPosition(TextPosition text)  {
    boolean isBold,isItalic;
    String s = null ;

    if (text.getFont().getFontDescriptor() != null )
    {   
                    {
            if (text.getFont().getFontDescriptor().isForceBold() ||  
            text.getFont().getFontDescriptor().getFontWeight() > 680 )
            {
            isBold = true;
           // System.err.println(text.getCharacter()+"==1");
            if (text.toString() == null || text.toString().isEmpty() ||
            text.toString().trim().isEmpty()){
            //  System.err.println(text.getCharacter()+"2");
                s = new StringBuilder().append("").append(text).toString();
                out.print(s);
            }
            s = new StringBuilder().append("").append(text).toString();
            out.print(s);
        }
      }
}

if (text.getFont().getFontDescriptor().isItalic())
        {
            isItalic = true;
        }

    if (text.toString() == null || text.toString().isEmpty() ||
    text.toString().trim().isEmpty()){
        s = new StringBuilder().append("").append(text).toString();
        out.print(s);
    }

}