What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)? [closed]

StackOverflow https://stackoverflow.com/questions/46869

Question

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to.

Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF.

This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external command-line app if I can.

Was it helpful?

Solution

You can use the IFilter interface built into Windows to extract text and properties (author, title, etc.) from any supported file type. It's a COM interface so you would have use the .NET interop facilities.

You'd also have to download the free PDF IFilter driver from Adobe.

OTHER TIPS

Here is a good list: Open Source Libs for PDF/C#

Most of these are geared toward creating PDFs, but they should have read capability as well.

There is this one as well: iText

I have only played with iText before. Nothing major.

We've used Aspose with good results.

Docotic.Pdf library can be used to extract formatted or plain text from PDF documents.

The library can read PDF documents of any version (up to the latest published standard). Extraction of pages is also supported by the library.

Links to sample code:

Disclaimer: I work for the vendor of the library.

Addition to the to the approved answer: there are also alternative commercial solutions to replace Adobe IFilter for text indexing (providing the similar API but also offering additional premium functionality):

  1. Foxit PDF IFilter: provides much faster text indexing comparing to Adobe's plugin.
  2. PDFLib PDF iFilter: includes support for damaged PDF documents plus the additional API to run your own queries.

If you are looking for the single tool that can be used from both managed .NET apps and legacy programming languages like classic ASP or VB6 then this is where the commercial ByteScout PDF Extractor SDK would fit as it provides both .NET and ActiveX/COM API.

Disclaimer: I work for ByteScout

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top