Question

I'm trying to extract text from pdf documents. I've tested several tools like PDFBox, TET, PDFTextStream and so on, but none of them is good for extracting the text of Persian multi-columns pdf documents.

Currently I'm trying to combine good features of this tools and using some tricks on them. Now I want to know that how I can detect number of columns of a page and how to split the texts of these columns.

Specially I want to know which class of PDFBox or PDFTextStream is responsible for column detection and how it work.

Was it helpful?

Solution

I can only speak for PDFTextStream, but in order to understand how it works, you want to understand, roughly, how PDFTextStream looks at a PDF document.

Each document is made up of Pages, which are made up of Blocks (of which there can be many and nested). Blocks will ultimately contain Lines, which will contain TextUnits.

Each of these units have an x, y, width and height property. All a PDF is are these basic units laid out based on their coordinates. When you ask PDFTextStream to "read" a page, or a region, it looks at the objects and how they are laid out on the X, Y plain and use an approximation of how that would translate to text. This is why you get errors, because there's no 100% foolproof way to turn this structure into machine-readable, structured data.

In PDFTextStream, you should look at the getRegionText function and example. PDFTextStream is proprietary (the reason why I'm moving to PDFBox), so I can't give you details about the algorithms used to fetch the text, but they're based on the above oversimplification.

Good luck.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top