Question

I want to be able to read the content of pdf files. I need to do that with C on Linux.

The closer i can get to this was here but I think Haru can only create pdf and is not able to read them (not 100% sure).

PS: I only need the plain text from pdf

Was it helpful?

Solution

Check out libpoppler. I've never used it work extracting text, just querying PDF attributes. It's pretty easy to use.

OTHER TIPS

How well do you need to parse them? Just extracting strings should be relatively easy, fully accurate rendering is harder. Take a look at the source for evince or ghostscript?

This is for C++ but might be a good starting point for understanding PDF structure http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx (sorry wrong link before)

Another possible, though I've never used it is VersyPDF. It claims to allow you to edit PDFs ... http://versypdf.sybrex-systems-ltd.qarchive.org/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top