If you're not intending to use scraperwiki.pdftoxml()
, then the warning doesn't apply. It doesn't stop you from installing the scraperwiki
package, however.
Also, that function doesn't work on Windows at all as is; it uses NamedTemporaryFiles
which behave differently on Windows to Linux.
If you do want to use that function, the simplest way to get an up-to-date version of pdftohtml
on Windows is to download Calibre Portable. (The version on Sourceforge is older.)
Install it anywhere; you just need a few files from it. From where you installed it, from the folder containing calibre.exe, you need pdftohtml.exe
into your working folder as well as, from the DLLs
folder in the Calibre install, freetype.dll
, jpeg.dll
, libpng12.dll
, zlib1.dll
.
You'll also need code based on scraperwiki.pdftoxml()
instead, like:
def pdftoxml(pdfdata, options):
"""converts pdf file to xml file"""
# lots of hacky Windows fixes c.f. original
with open('input.pdf', 'wb') as f:
f.write(pdfdata)
cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes '
if options:
cmd += options
cmd += 'input.pdf output.xml'
cmd = cmd + " > NUL 2>&1"
os.system(cmd)
with open('output.xml', 'r') as f:
return f.read()
(I was trying to get this working for a user in Windows recently; I'll keep the gist containing this code updated.)