Question

I am new to Apache PDFBox library.

I want to map font information to the PDF paragraphs

I have already gone through Questios How to extract font styles of text contents using pdfbox?

But it doesn't give information about which paragraph is written in which font.

for example,if my page contains text:

para1:Arial

para2:Times New Roman

Then i should be able to get the information that para1 is written in Arial while para2 is written in Times New Roman.

Solution proposed in above question gives the information that the PDF page contains only

arial and times new roman .

Was it helpful?

Solution

The PDFTextStripper class you use is documented (cf. its JavaDoc comment) like this:

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

To get specific font information, therefore, you have to change it somewhat.

The font information is available in this class all along and only discarded when outputting a line, have a look at its source:

protected void writePage() throws IOException
{
    [...]
    for( int i = 0; i < charactersByArticle.size(); i++)
    {
        [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
        while( textIter.hasNext() )
        {
            [...]
            if( lastPosition != null )
            {
                [...]
                if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                {
                    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                    line.clear();
                    [...]
                }
............

The TextPosition instances in that list line still have all formatting information available, among them the font used, only while "normalizing" line it is reduced to pure characters.

To keep font information, therefore, you have different options, depending on how you want to retrieve the font information:

  • If you want to continue retrieving all page content information (including fonts) in a single String via getText: You change the method

    private List<String> normalize(List<TextPosition> line, boolean isRtlDominant, boolean hasRtl)
    

    to include some font tags (e.g. [Arial]) of your choice whenever the font changes. Unfortunately this method is private. Thus, you have to copy the whole PDFTextStripper class and change the code of the copy.

  • If you want to retrieve the specificfont information in a different structure (e.g. as List<List<TextPosition>>) you can derive your own stripper class from PDFTextStripper, add some variable of your desired type, and override the protected method writePage mentioned above, copying it and only enhancing it right before or after the line

    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
    

    with code adding the information to your new variable. E.g.

    public class MyPDFTextStripper extends PDFTextStripper
    {
        public List<List<TextPosition>> myLines = new ArrayList<List<TextPosition>>();
        [...]
                    if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                    {
                        writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                        myLines.add(new ArrayList<TextPosition>(line));
                        line.clear();
                        [...]
                    }
    

    Now you can call getText for an instance of your MyPDFTextStripper, retrieve the plain text as result, and access the additional data via the new variable

OTHER TIPS

To add more fonts, other than library font so you need to specifically add fonts file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top