Question

I have multiple PDFs and I want to extract text from a certain region from their first pages. So, given I have the coordinates for the bounding box for the text in the PDF, how do I extract that text using command line.

I researched a bit and found that PDFMiner and PDFBox can do this. But PDFMiner is very poorly documented.

Can someone tell me how to do this using PDFMiner? OR if you could suggest some other solution?

PS: I am on Linux Terminal.

Was it helpful?

Solution

pdftotext (take one of the latest, Poppler-based versions) does let you define a page region to extract text from.

Try this:

pdftotext    \
  -f 5       \
  -l 7       \
  -x 200     \
  -y 700     \
  -W 144     \
  -H 80      \
   input.pdf \
   output.txt

It selects page range 5-7, and a rectangle of width = 144 points (72 points == 1 inch), height = 80 points where the top left corner is at x-coordinate 200, and y-coordinate 700.

OTHER TIPS

You could use PDFBox. https://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripperByArea.html

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
stripper.addRegion( "class1", rectangle );
System.out.println( "Text in the area:" + rectangle );
System.out.println( "Text: " + stripper.getTextForRegion( "class1" ) );

Here rectange is object of Rectangle class of java.awt package. http://docs.oracle.com/javase/7/docs/api/java/awt/Rectangle.html

Rectangle rectange = new Rectangle(int x, int y, int width, int height);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top