Question

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?

Was it helpful?

Solution

You might find Docsplit useful:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

OTHER TIPS

After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

Here's some options:

http://en.wikipedia.org/wiki/List_of_PDF_software

From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/

Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).

The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.

If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient. it is not suited for dealing with images.

Did you have a look at the CombinePDF library?

It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.

Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

You can also stamp text, number pages or use :

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top