Python Curl writefunction not working onsecond call

https://stackoverflow.com/questions/16388751

14-04-2022
|

Question

I've written a simple script in Python.

It parses the hyperlinks from a webpage, and afterwards these links are retrieved to parse some information.

I have similar scripts running and re-using the writefunction without any problems, for some reason it fails, and I can't figure it out why.

General Curl init:

storage = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.COOKIEFILE, "")
c.setopt(pycurl.POST, 0)
c.setopt(pycurl.FOLLOWLOCATION, 1)
#Similar scripts are working this way, why this script not?
c.setopt(c.WRITEFUNCTION, storage.write)

First call to retreive links:

URL = "http://whatever"
REFERER = URL

c.setopt(pycurl.URL, URL)
c.setopt(pycurl.REFERER, REFERER)
c.perform()

#Write page to file
content = storage.getvalue()
f = open("updates.html", "w")
f.writelines(content)
f.close()
... Here the magic happens and links are extracted ...

Now looping these links:

for i, member in enumerate(urls):
    URL = urls[i]
    print "url:", URL
    c.setopt(pycurl.URL, URL)
    c.perform()

    #Write page to file
    #Still the data from previous!
    content = storage.getvalue()
    f = open("update.html", "w")
    f.writelines(content)
    f.close()
    #print content
    ... Gather some information ...
    ... Close objects etc ...

Solution

If you want to download urls to different files in sequence (no concurrent connections):

for i, url in enumerate(urls):
    c.setopt(pycurl.URL, url)
    with open("output%d.html" % i, "w") as f:
        c.setopt(c.WRITEDATA, f) # c.setopt(c.WRITEFUNCTION, f.write) also works
        c.perform()

Note:

storage.getvalue() returns everything that was written to storage from the moment it is created. In your case you should find the output from multiple urls in it
open(filename, "w") overwrites the file (previous content is gone) i.e., update.html contains whatever is in content on the last iteration of the loop

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow