Font information of text in PDF using PDFBox

Question 1

The PDFTextStripper class you use is documented (cf. its JavaDoc comment) like this:

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

To get specific font information, therefore, you have to change it somewhat.

The font information is available in this class all along and only discarded when outputting a line, have a look at its source:

protected void writePage() throws IOException
{
    [...]
    for( int i = 0; i < charactersByArticle.size(); i++)
    {
        [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
        while( textIter.hasNext() )
        {
            [...]
            if( lastPosition != null )
            {
                [...]
                if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                {
                    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                    line.clear();
                    [...]
                }
............

The TextPosition instances in that list line still have all formatting information available, among them the font used, only while "normalizing" line it is reduced to pure characters.

To keep font information, therefore, you have different options, depending on how you want to retrieve the font information:

If you want to continue retrieving all page content information (including fonts) in a single String via getText: You change the method
```
private List<String> normalize(List<TextPosition> line, boolean isRtlDominant, boolean hasRtl)
```
to include some font tags (e.g. [Arial]) of your choice whenever the font changes. Unfortunately this method is private. Thus, you have to copy the whole PDFTextStripper class and change the code of the copy.

If you want to retrieve the specificfont information in a different structure (e.g. as List<List<TextPosition>>) you can derive your own stripper class from PDFTextStripper, add some variable of your desired type, and override the protected method writePage mentioned above, copying it and only enhancing it right before or after the line

writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);

with code adding the information to your new variable. E.g.

public class MyPDFTextStripper extends PDFTextStripper
{
    public List<List<TextPosition>> myLines = new ArrayList<List<TextPosition>>();
    [...]
                if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                {
                    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                    myLines.add(new ArrayList<TextPosition>(line));
                    line.clear();
                    [...]
                }

Now you can call getText for an instance of your MyPDFTextStripper, retrieve the plain text as result, and access the additional data via the new variable

Question 2

To add more fonts, other than library font so you need to specifically add fonts file.