Domanda

I have multiple PDFs and I want to extract text from a certain region from their first pages. So, given I have the coordinates for the bounding box for the text in the PDF, how do I extract that text using command line.

I researched a bit and found that PDFMiner and PDFBox can do this. But PDFMiner is very poorly documented.

Can someone tell me how to do this using PDFMiner? OR if you could suggest some other solution?

PS: I am on Linux Terminal.

È stato utile?

Soluzione

pdftotext (take one of the latest, Poppler-based versions) does let you define a page region to extract text from.

Try this:

pdftotext    \
  -f 5       \
  -l 7       \
  -x 200     \
  -y 700     \
  -W 144     \
  -H 80      \
   input.pdf \
   output.txt

It selects page range 5-7, and a rectangle of width = 144 points (72 points == 1 inch), height = 80 points where the top left corner is at x-coordinate 200, and y-coordinate 700.

Altri suggerimenti

You could use PDFBox. https://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripperByArea.html

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
stripper.addRegion( "class1", rectangle );
System.out.println( "Text in the area:" + rectangle );
System.out.println( "Text: " + stripper.getTextForRegion( "class1" ) );

Here rectange is object of Rectangle class of java.awt package. http://docs.oracle.com/javase/7/docs/api/java/awt/Rectangle.html

Rectangle rectange = new Rectangle(int x, int y, int width, int height);
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top