문제

Let's say we have a pdf file that has clickable contents page. (I am talking about chapters and subchapters) How can that certain file be parsed in C# and how can an application realize whether the pdf it is reading has or has not chapters/contents etc?

This is a link to a pdf without clickable table of contents https://docs.google.com/open?id=0B1EbI-EMJxmkODE1Mm5WbFpEdXc I did not seem to find a pdf with clickable table of contents but I found a guide on how to do it here http://everythingyoumightneed.blogspot.com/2013/01/how-to-create-pdf-with-clickable-links.html

So my question is: How can an app differentiate which is which and how can the one with clickable links be parsed?

도움이 되었습니까?

해결책

Your problem is not dissimilar to trying to figure out where paragraphs and columns are in PDF files; PDF doesn't typically label a table of contents page as such. So even with a PDF library (such as iTextSharp pointed out by mkl), this will not be a trivial task.

With such a library, you will be able to see the pages in the PDF file and the text on the pages. However, if this is a book for example, the table of contents page may be the first, second, third or xth page in the PDF file because of various other pages appearing in front of it (cover, second cover, copyright, tributes, you name it...).

So an algorithm to discover whether there is a table of content would have to be able to discover it somewhere in the first x pages of the PDF file. As there are no standard tags highlighting the text in the table of contents, this would have to be done through analysis of the format of the text on that page.

There are two things that could be of help (if they are available):

1) In many PDF files the items in a table are contents are like you say clickable. So you could look in the PDF file and try to find the first page that contains a lot of hyperlinked items.

2) In many PDF file the table of contents is mirrored in bookmarks. So you could also examine the bookmarks structure and see if you can use that to figure out how many chapters there are in the book.

Keep in mind that both of these features are optional and not standardizes if they are present.

다른 팁

Since PDF is an binary format you'll have to use a pdf-library like pdflib in order to read pdf-files.

pdfLib

also you may want to check out this CodeProject site for some examples Converting PDF to Text in C#

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top