This ...
How the PDf viewer know that CANADA is in the second column not in the third.
is the wrong sort of question -- but the "why" contains hints for a possible solution.
The question is "wrong" because your "PDF viewer" does not know text should be in the second column. There "is" no spoon column in a PDF: all that the viewer gets is a list of (x,y) positions and text to display it on. All it has to do is move a cursor to that (x,y) position and draw the text. See? No columns involved. Not a single [Tab] character either (or any other kind of magic \specialChar
, for that matter).
A dumb, straightforward to-text converter scans the input file for text runs and writes them out immediately. It may test for x-positions that are larger than expected, and insert a space when necessary -- in fact, it seems iText does this because inspecting your file shows there is no 'space' character stored between "1" and "WJC:PLAYOFFS CANADA". There is a move to a larger x position on the same y position, so iText infers there is 'something'.
A possible solution is to store all (x,y) coordinates of all text fragments, sort them, and then test whether the end of each text fragment is within a reasonable distance of the start of the next one. (This requires you to retrieve the character widths as well.) If the distance is more or less equal to a space width, you can output a 'space'. If it's more, you can output a [Tab]. The following is the output of a simple PDF reader that does exactly this:
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
2 WJC:PLYOFF CAN PSTGM TSN+ ..W.... 21:54 22:21 3558
3 BIG BANG THEORY CTV Total ...T... 20:00 20:31 3334
-- I aligned the columns manually for clarity, as there was only a single [Tab] between each column. Your document is 'easy', in that every column contains some text. It's ever so slightly harder if it does not (but if necessary, you could create a list of likely tab positions, and test each new text string against that).
In short, you cannot use the plain function getTextFromPage
, you need to retrieve correct x and y positions and process them.
Surprising: for some unknown reason the line
20 LAW AND ORDER:SVU CTV Total W 21:00 23:00 1295
is included twice in this document on exactly the same position. I did not anticipate that, and so after sorting, I got this in my output:
20<FONT ArialMT>20 LALAWW ANANDD ORDEORDER:SR:SVUVU CTCTVV TTotalotal ..WW.... 21:0021:00 23:0023:00 1295<FONT Arial-BoldMT>1295
A simpler solution
... would be to manually create a list of "Broadcast Outlets". The list has a fairly predictable format: [digits] [Title] [Outlet] ..
(etc.), and only Title and Outlet do not follow a specific pattern. In this list I count just 4 different broadcasters. Parsing the remaining 'columns' should be straightforward.