PDF Data Extraction - Need Suggestions

https://stackoverflow.com/questions/5338062

26-10-2019
|

Question

I created a pdf extraction tool. Sample screen attached. enter image description here User can load a pdf file and select data area he wants. Then I grab pdf coordinates and page number and then save it as a template. Once user a give a list of pdf files tool is capable of extracting data according to the template file. My tool is very much similar to this.

Now problem is sometimes in some pdfs the portion of data required to extract is shifted to next page. (The reason for shifting is; I will give a example. If you think a bill of list of items you purchased, The place of "Total Value" printed is depend on the number of items you bought: if it's a long list total goes bottom otherwise, middle or near top).

Therefore now I am thinking about identify the structure of the pdf instead of getting coordinates.

But I don't have a clear idea to do that. Please share anything, you think that help to solve this problem. I repeat again that I am trying to grab data from a pdf. So It is possible to capture the structure of an pdf file.

My idea is if I can identify the structure then I can say where the value is. For example I tried to convert pdf into html and try to navigate through the html tag values. (body->div->table->td-> etc.) But it wasn't successful.. :(

Solution

PDF has only weak structures, nothing like divs or containers. There are layer groups and similar, but coordinates are the only thing, you can count on.

Try to describe type of text and margins from left and right, to make your capture page independent.

OTHER TIPS

The PDF file format includes an optional set of metatags. If these are used, the file will have some structure. Otherwise you are out of luck. I wrote a blog post telling you how to find this out at http://www.jpedal.org/PDFblog/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/

You can use some "anchor", like "ORDER QTY" and then capture data relative to that one. Take a look at www.ivytools.net - in that tool you can define rules that specify how to find values relative to other text in the document. In your example it would be something like:

p.Find("ORDER QTY").Down()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow