python - pull pdfs from webpage and convert to html

Question 1

I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.

After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.

Edit:

links = sel.xpath('//a[contains(@href, "enforcementactions.pdf") and contains(@class, "titlelink")]')
for link in links:
    item = PDFItem()
    item['title'] = link.xpath('text()').extract()[0]
    item['url'] = URL + link.xpath('@href').extract()[0]

Question 2

In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.

The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.

So the libraries you would need are a HTTP library, like Requests and PDFMiner.

The work flow of your script will be something like:

import os
import requests
from subprocess import Popen

...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
    # get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)

More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here