Extract all text with string positions from a PDF

https://stackoverflow.com/questions/9975036

28-05-2021
|

Question

This may seem an old question, but I didn't find an exhaustive answer after spending half an hour searching all over SO.

I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf I am using (E-Tickets) the program fails to recognize strings, printing each character separately. The output is a list of strings (each representing a TextPosition object) like this:

String[414.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.0] s
String[418.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] a
String[423.38696,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=1.776001] l
String[425.16296,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] e

While I would like the program to recognize the string "sale" as an unique TextPosition and give me its position. I also tried to play with the setSpacingTolerance() and setAverageCharacterTolerance() PDFTextStripper methods, setting different values above and under the standard values (which FYI are 0.5 and 0.3 respectively), but the output didn't change at all. Where am I going wrong? Thanks in advance.

Solution

As Joey mentioned, PDF is just a collection of instructions telling you where a certain character should be printed.

In order to extract words or lines, you will have to perform some data segmentation: studying the bounding boxes of the characters should let you recognize those that are on a same line and then which one form words.

OTHER TIPS

Here is your Solution: 1. Reading File 2. Fetching Each Page to Text by using PDFParserTextStripper 3. Each Position of the text will be printed by char.

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
class PDFParserTextStripper extends PDFTextStripper {
    public PDFParserTextStripper(PDDocument pdd) throws IOException {
        super();
        document = pdd;
    }
    public void stripPage(int pageNr) throws IOException {
        this.setStartPage(pageNr + 1);
        this.setEndPage(pageNr + 1);
        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        writeText(document, dummy); // This call starts the parsing process and calls writeString repeatedly.
    }
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
        for (TextPosition text : textPositions) {
            System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSizeInPt()
                    + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space="
                    + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + " ] " + text.getUnicode());
        }
    }
    public static void extractText(InputStream inputStream) {
        PDDocument pdd = null;
        try {
            pdd = PDDocument.load(inputStream);
            PDFParserTextStripper stripper = new PDFParserTextStripper(pdd);
            stripper.setSortByPosition(true);
            for (int i = 0; i < pdd.getNumberOfPages(); i++) {
                stripper.stripPage(i);
            }
        } catch (IOException e) {
            // throw error
        } finally {
            if (pdd != null) {
                try {
                    pdd.close();
                } catch (IOException e) {
                }
            }
        }
    }
    public static void main(String[] args) throws IOException {
        File f = new File("C://PDFLOCATION//target.pdf");
        FileInputStream fis = null;
        try {
            fis = new FileInputStream(f);
            extractText(fis);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (fis != null)
                    fis.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow