parsing a pdf file with clickable contents page

Question 1

Your problem is not dissimilar to trying to figure out where paragraphs and columns are in PDF files; PDF doesn't typically label a table of contents page as such. So even with a PDF library (such as iTextSharp pointed out by mkl), this will not be a trivial task.

With such a library, you will be able to see the pages in the PDF file and the text on the pages. However, if this is a book for example, the table of contents page may be the first, second, third or xth page in the PDF file because of various other pages appearing in front of it (cover, second cover, copyright, tributes, you name it...).

So an algorithm to discover whether there is a table of content would have to be able to discover it somewhere in the first x pages of the PDF file. As there are no standard tags highlighting the text in the table of contents, this would have to be done through analysis of the format of the text on that page.

There are two things that could be of help (if they are available):

1) In many PDF files the items in a table are contents are like you say clickable. So you could look in the PDF file and try to find the first page that contains a lot of hyperlinked items.

2) In many PDF file the table of contents is mirrored in bookmarks. So you could also examine the bookmarks structure and see if you can use that to figure out how many chapters there are in the book.

Keep in mind that both of these features are optional and not standardizes if they are present.

Question 2

Since PDF is an binary format you'll have to use a pdf-library like pdflib in order to read pdf-files.

pdfLib

also you may want to check out this CodeProject site for some examples Converting PDF to Text in C#