سؤال

I have a python script that fetches a webpage and mirrors it. It works fine for one specific page, but I can't get it to work for more than one. I assumed I could put multiple URLs into a list and then feed that to the function, but I get this error:

Traceback (most recent call last):
  File "autowget.py", line 46, in <module>
    getUrl()
  File "autowget.py", line 43, in getUrl
    response = urllib.request.urlopen(url)
  File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.2/urllib/request.py", line 361, in open
    req.timeout = timeout
AttributeError: 'tuple' object has no attribute 'timeout'

Here's the offending code:

url = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(*url):
    response = urllib.request.urlopen(url)
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
getUrl()

I've exhausted Google trying to find how to open a list with urlopen(). I found one way that sort of works. It takes a .txt document and goes through it line-by-line, feeding each line as a URL, but I'm writing this using Python 3 and for whatever reason twillcommandloop won't import. Plus, that method is unwieldy and requires (supposedly) unnecessary work.

Anyway, any help would be greatly appreciated.

هل كانت مفيدة؟

المحلول

In your code there are some errors:

  • You define getUrls with variable arguments list (the tuple in your error);
  • You manage getUrls arguments as a single variable (list instead)

You can try with this code

import urllib2
import shutil

urls = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(urls):
   for url in urls:
      #Only a file_name based on url string
      file_name = url.replace('https://', '').replace('.', '_').replace('/','_')
      response = urllib2.urlopen(url)
      with open(file_name, 'wb') as out_file:
         shutil.copyfileobj(response, out_file)
getUrl(urls)

نصائح أخرى

It do not support tuple:

urllib.request.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.

And your calling is incorrect. It should be:

getUrl(url[0],url[1],url[2])

And inside the function, use a loop like "for u in url" to travel all urls.

You should just iterate over your URLs using a for loop:

import shutil
import urllib.request


urls = ['https://www.example.org/', 'https://www.foo.com/']

file_name = 'foo.txt'

def fetch_urls(urls):
    for i, url in enumerate(urls):
        file_name = "page-%s.html" % i
        response = urllib.request.urlopen(url)
        with open(file_name, 'wb') as out_file:
            shutil.copyfileobj(response, out_file)

fetch_urls(urls)

I assume you want the content saved to separate files, so I used enumerate here to create a uniqe file name, but you can obviously use anything from hash(), the uuid module to creating slugs.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top