Question

I have been trying to install Scraperwiki module for Python. However, it generates the error:

""UserWarning: Local Scraperlibs requires pdftohtml, but pdftohtml was not found in the PATH. You probably need to install it".

I looked into poppler as they have pdftohtml file but I don't know how it works - whether there is a python library I need to install or a .exe file. And how do I go about installing it. Running on Windows.

Many Thanks

Was it helpful?

Solution

If you're not intending to use scraperwiki.pdftoxml(), then the warning doesn't apply. It doesn't stop you from installing the scraperwiki package, however.

Also, that function doesn't work on Windows at all as is; it uses NamedTemporaryFiles which behave differently on Windows to Linux.

If you do want to use that function, the simplest way to get an up-to-date version of pdftohtml on Windows is to download Calibre Portable. (The version on Sourceforge is older.)

Install it anywhere; you just need a few files from it. From where you installed it, from the folder containing calibre.exe, you need pdftohtml.exe into your working folder as well as, from the DLLs folder in the Calibre install, freetype.dll, jpeg.dll, libpng12.dll, zlib1.dll.

You'll also need code based on scraperwiki.pdftoxml() instead, like:

def pdftoxml(pdfdata, options):
    """converts pdf file to xml file"""
    # lots of hacky Windows fixes c.f. original
    with open('input.pdf', 'wb') as f:
    f.write(pdfdata)
    cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes '
    if options:
        cmd += options
    cmd += 'input.pdf output.xml'
    cmd = cmd + " > NUL 2>&1"
    os.system(cmd)
    with open('output.xml', 'r') as f:
    return f.read()

(I was trying to get this working for a user in Windows recently; I'll keep the gist containing this code updated.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top