Question

I want to parse this file (http://www.bbm.ca/_documents/top_30_tv_programs_english/2011/nat01032011.pdf) with iText. The problem is it is not tagged so I can't get the XML file. I decided to extract the text from it and I thought that for example the first line will be like :

1\specialCharWJC:PLAYOFFS CANADA\specialCharTSN+\specialCharM.W....\specialChar19:30\specialChar21:57\specialChar5133

The text I extracted for the first line is

1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133

I extracted the text using :

PdfReader reader = new PdfReader(filename);
String str = PdfTextExtractor.getTextFromPage(reader, 1);

How the PDf viewer know that CANADA is in the second column not in the third.

My current solution is to convert the pdf file to html5 using http://www.idrsolutions.com/online-pdf-to-html5-converter/ who can determine the text for each column.

Thanks for your response

Was it helpful?

Solution 2

This ...

How the PDf viewer know that CANADA is in the second column not in the third.

is the wrong sort of question -- but the "why" contains hints for a possible solution.

The question is "wrong" because your "PDF viewer" does not know text should be in the second column. There "is" no spoon column in a PDF: all that the viewer gets is a list of (x,y) positions and text to display it on. All it has to do is move a cursor to that (x,y) position and draw the text. See? No columns involved. Not a single [Tab] character either (or any other kind of magic \specialChar, for that matter).

A dumb, straightforward to-text converter scans the input file for text runs and writes them out immediately. It may test for x-positions that are larger than expected, and insert a space when necessary -- in fact, it seems iText does this because inspecting your file shows there is no 'space' character stored between "1" and "WJC:PLAYOFFS CANADA". There is a move to a larger x position on the same y position, so iText infers there is 'something'.

A possible solution is to store all (x,y) coordinates of all text fragments, sort them, and then test whether the end of each text fragment is within a reasonable distance of the start of the next one. (This requires you to retrieve the character widths as well.) If the distance is more or less equal to a space width, you can output a 'space'. If it's more, you can output a [Tab]. The following is the output of a simple PDF reader that does exactly this:

1   WJC:PLAYOFFS CANADA     TSN+        M.W.... 19:30   21:57   5133
2   WJC:PLYOFF CAN PSTGM    TSN+        ..W.... 21:54   22:21   3558
3   BIG BANG THEORY         CTV Total   ...T... 20:00   20:31   3334

-- I aligned the columns manually for clarity, as there was only a single [Tab] between each column. Your document is 'easy', in that every column contains some text. It's ever so slightly harder if it does not (but if necessary, you could create a list of likely tab positions, and test each new text string against that).

In short, you cannot use the plain function getTextFromPage, you need to retrieve correct x and y positions and process them.


Surprising: for some unknown reason the line

20  LAW AND ORDER:SVU   CTV Total   W   21:00   23:00   1295

is included twice in this document on exactly the same position. I did not anticipate that, and so after sorting, I got this in my output:

20<FONT ArialMT>20 LALAWW ANANDD ORDEORDER:SR:SVUVU CTCTVV TTotalotal ..WW.... 21:0021:00 23:0023:00 1295<FONT Arial-BoldMT>1295

A simpler solution

... would be to manually create a list of "Broadcast Outlets". The list has a fairly predictable format: [digits] [Title] [Outlet] .. (etc.), and only Title and Outlet do not follow a specific pattern. In this list I count just 4 different broadcasters. Parsing the remaining 'columns' should be straightforward.

OTHER TIPS

I wrote the iText text extractor. There are two extraction strategies in iText - one is naive (more proof of concept) that just dumps text as it hits it. The other (LocationTextExtractionStrategy) is much more refined with how it builds strings using the location and font informaton at @Jongware suggests (it also takes all coordinate transformations into account). The latter is the default strategy if you just call getTextFromPage() like you are.

The reason that the row 20 text displays twice is b/c some PDF producers do that to emulate a bold glyph (they shift the characters a tad and re-render). So that is not a bug, really - but certainly could be an opportunity for improvement. There may be something that we could do if we detect chunks of identical content that land within a certain twips zone of each other. The reason we haven't done this already is that this can be REALLY tricky, b/c you might have one chunk that is the entire word, and another set of chunks - one for each letter. We have the ability to do sub-chunk analysis (and in fact this is exposed in the parser interface somewhere - can't recall off hand - let me know if you need it and I'll track it down) - but that would come with a pretty hefty performance penalty, so I'm loathe to do it.

Anyway, the way that I would solve this specific challenge would be to set up physical zones and pass a region filter into the LocationTextExtractionStrategy#getResultantText() call.

If you truly need to insert tab characters (or some column marker) based on the horizontal position of the text, this is quite doable - take a look at where the isChunkAtWordBoundary() method is called in the LocationTextExtractionStrategy source code and add your own handler for inserting special characters beyond a space. It would also be possible to do some sort of contextual analysis (i.e. notice that there are a bunch of chunks that happen to share the same X position and orientation, and designate that X position as a tab stop).

If you come up with an idea that is nice and generic (i.e. not specific to this one parsing task), let me know and I'll see what I can do to incorporate it into iText.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top