Question

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.

So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.

What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!


EDIT: So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:

links = sel.xpath('//a[contains(@href, "enforcementactions.pdf") and contains(@class, "titlelink")]')
item = PDFItem()
for link in links:
        item['title'] = link.xpath('/text()')
        item['url'] = URL + link.xpath('@href').extract()[0]

The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?

Was it helpful?

Solution

I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.

After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.

Edit:

links = sel.xpath('//a[contains(@href, "enforcementactions.pdf") and contains(@class, "titlelink")]')
for link in links:
    item = PDFItem()
    item['title'] = link.xpath('text()').extract()[0]
    item['url'] = URL + link.xpath('@href').extract()[0]

OTHER TIPS

In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.

The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.

So the libraries you would need are a HTTP library, like Requests and PDFMiner.

The work flow of your script will be something like:

import os
import requests
from subprocess import Popen

...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
    # get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)

More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top