Question

I am wishing to extract the content of pdf files available online using PDFMiner.

My code is based on the one available in the documentation used to extract the content of PDF files on the hard disk:

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
document = PDFDocument(parser)

That works quite well with some small changes.

Now, I have tried urllib2.openurl for online PDFs but that doesn't work. I get an error message : coercing to Unicode: need string or buffer, instance found.

How can I get a string (or whatever) from urllib2.openurl so that it is the same as what the open function when I give it a PDF file name (versus an URL)`?

Please tell me if my question is not clear.

Was it helpful?

Solution

Well, I finally found out a solution,

I resorted on Request and StringIO and got rid off the open('my_file', 'rd') command

from urllib2 import Request
from StringIO import StringIO

url = 'my_url'

open = urllib2.urlopen(Request(url)).read()
memoryFile = StringIO(open)

parser = PDFParser(memoryFile)

That way Python considers the url as a file (to say so).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top