Ruby/Rails: Traverse folders and parse metadata to seed DB

https://stackoverflow.com/questions/12397070

01-07-2021
|

Question

I have a bunch of documents that I'd like to index in a Rails application. I'd like to use a rake task of sorts to comb a directory hierarchy looking for files and capturing the metadata from those files to index in Rails.

I'm not really sure how to do this in Ruby. I have found a utility called pdftk which can extract the metadata from the PDF files (much of what I'm indexing is PDFs) but I'm not sure how to capture the individual pieces of that data?

For example, to grab the ModDate or each BookmarkTitle and BookmarkPageNumber below.

Specifically I want to traverse a file hierarchy, execute the pdftk $filename dump_data command for each .pdf I find and then capture the important parts of that output into a rails model(s).

Output from pdftk:

$ pdftk BoringDocument883c2.pdf dump_data
InfoKey: Creator
InfoValue: Adobe Acrobat 9.3.4
InfoKey: Producer
InfoValue: Adobe Acrobat 9.34 Paper Capture Plug-in
InfoKey: ModDate
InfoValue: D:20110312194536-04'00'
InfoKey: CreationDate
InfoValue: D:20110214174733-05'00'
PdfID0: 2f28dcb8474c6849ae8628bc4157df43
PdfID1: 3e13c82c73a9f44bad90eeed137e7a1a
NumberOfPages: 126
BookmarkTitle: Alternative Maintenance Techniques&#13;
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkTitle: CONTENTS&#13;
BookmarkLevel: 1
BookmarkPageNumber: 4
BookmarkTitle: EXHIBITS&#13;
BookmarkLevel: 1
BookmarkPageNumber: 6
BookmarkTitle: I - INTRODUCTION&#13;
BookmarkLevel: 1
BookmarkPageNumber: 8
BookmarkTitle: II - EXECUTIVE SUMMARY&#13;
BookmarkLevel: 1
BookmarkPageNumber: 13
BookmarkTitle: III - REMOTE DIAGNOSTICS - A STATUS REPORT&#13;
BookmarkLevel: 1
BookmarkPageNumber: 30
BookmarkTitle: IV - ALTERNATIVE TECHNIQUES&#13;
BookmarkLevel: 1
BookmarkPageNumber: 55
BookmarkTitle: V - COMPANYA - A SERVICE PHILOSOPHY&#13;
BookmarkLevel: 1
BookmarkPageNumber: 66
BookmarkTitle: VI - COMPANYB - REDUNDANT HARDWARE ARCHITECTURE&#13;
BookmarkLevel: 1
BookmarkPageNumber: 77
...shortened for brevity...
PageLabelNewIndex: 1
PageLabelStart: 1
PageLabelPrefix: F-E12_0001.jpg
PageLabelNumStyle: NoNumber
PageLabelNewIndex: 2
PageLabelStart: 1
PageLabelPrefix: F-E12_0002.jpg
PageLabelNumStyle: NoNumber
PageLabelNewIndex: 3
PageLabelStart: 1
PageLabelPrefix: F-E12_0003.jpg
PageLabelNumStyle: NoNumber
...

Edit: I've recently found the pdf-reader gem which looks promising and may obviate the need for triggering pdftk, somehow, in the shell?!?

Solution

First off, let me say that my knowledge of Rake isn't that good, so there might be some mistakes. Let me know if something doesn't work and I would be happy to try and fix the problem.

To solve this, I am going to use 2 rake tasks. One of the rake tasks will be a recursive directory traversal task, and the other will be a task which kicks off the recursion.

desc "Populate the database with PDF metadata from the default PDF path"
task :populate_all_pdf_metadata do
  pdf_path = "/path/to/pdfs"

  Rake::Task[:populate_pdf_metadata].invoke(pdf_path)
end

desc "Recursively traverse a path looking for PDF metadata"
task :populate_pdf_metadata, :pdf_path do |t, args|
  excluded_dir_names = [".", ".."] # Do not look in dirs with these names.

  pdf_path = args[:pdf_path]

  Dir.entries(pdf_path).each do |file|
    if Dir.directory?(file) && !excluded_dir_names.include?(file)
      Rake::Task[:populate_pdf_metadata].invoke(pdf_path + "/" + file)
    elsif File.extname(file) == ".pdf"
      reader = PDF::Reader.new(file)

      # Populate the database here
    end
  end
end

I believe the code above is similar to what you want to do. In order to access the database you will need to add the :environment dependency to your tasks. You can search Google for how to access ActiveRecord models from a rake tasks. I hope this helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow