Installing Scraperwiki for Python generates an error pdftohtml not found

Question

If you're not intending to use scraperwiki.pdftoxml(), then the warning doesn't apply. It doesn't stop you from installing the scraperwiki package, however.

Also, that function doesn't work on Windows at all as is; it uses NamedTemporaryFiles which behave differently on Windows to Linux.

If you do want to use that function, the simplest way to get an up-to-date version of pdftohtml on Windows is to download Calibre Portable. (The version on Sourceforge is older.)

Install it anywhere; you just need a few files from it. From where you installed it, from the folder containing calibre.exe, you need pdftohtml.exe into your working folder as well as, from the DLLs folder in the Calibre install, freetype.dll, jpeg.dll, libpng12.dll, zlib1.dll.

You'll also need code based on scraperwiki.pdftoxml() instead, like:

def pdftoxml(pdfdata, options):
    """converts pdf file to xml file"""
    # lots of hacky Windows fixes c.f. original
    with open('input.pdf', 'wb') as f:
    f.write(pdfdata)
    cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes '
    if options:
        cmd += options
    cmd += 'input.pdf output.xml'
    cmd = cmd + " > NUL 2>&1"
    os.system(cmd)
    with open('output.xml', 'r') as f:
    return f.read()

(I was trying to get this working for a user in Windows recently; I'll keep the gist containing this code updated.)