我如何能加速取页urllib2在蟒蛇?

https://stackoverflow.com/questions/3490173

28-09-2019
|

题

我有一个脚本，获取了几个网页和分析的信息。

(一个例子可以看到 http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

我跑cProfile上它，并且我假定，urlopen占用很多时间。有没有办法获取的网页更快？或一种方式来获取好几页？我做什么都是简单的，因为我是新来的python和网络发展。

在此先感谢！:)

更新：我有一个函数 fetchURLs(), 我用来做一系列的网址，我需要如此喜欢的东西 urls = fetchURLS().该网址是所有XML文件从亚马逊和eBay Api(其混淆了我为什么需要如此长的时间负载，也许我的虚拟主机是缓慢的?)

我需要做的是装载的每一个网址，阅读每一页上，并将该数据发送到另一个脚本的一部分，将分析和显示的数据。

注意，我不能这样做的后一部分，直到所有的页面已经取，这就是我的问题。

此外，我的主机限制了我25过程的时间，我相信，所以，无论是最简单服务器上将是好的：)

在这里，它是对时间：

Sun Aug 15 20:51:22 2010    prof

         211352 function calls (209292 primitive calls) in 22.254 CPU seconds

   Ordered by: internal time
   List reduced from 404 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10   18.056    1.806   18.056    1.806 {_socket.getaddrinfo}
     4991    2.730    0.001    2.730    0.001 {method 'recv' of '_socket.socket' objects}
       10    0.490    0.049    0.490    0.049 {method 'connect' of '_socket.socket' objects}
     2415    0.079    0.000    0.079    0.000 {method 'translate' of 'unicode' objects}
       12    0.061    0.005    0.745    0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
     3428    0.060    0.000    0.202    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
     1698    0.055    0.000    0.068    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
     4125    0.053    0.000    0.056    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
     1698    0.042    0.000    0.358    0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
     1698    0.042    0.000    0.275    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)

解决方案

编辑:我扩大的回答包括更精美的例子。我已经发现了很多敌意和误导，在此后关于穿诉s.异步I/O.因此，我也增加更多论据来反驳某些无效的权利要求。我希望这将有助于人们选择正确的工具的权利的工作。

这是一个dup的一个问题是3天前。

蟒蛇urllib2.开放慢，需要有更好的方法读几个网址栈溢出蟒蛇urllib2.urlopen()缓慢，需要有更好的方法读几个网址

我是抛光的代码显示如何取多个网页，并用螺纹。

import time
import threading
import Queue

# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
    result = Queue.Queue()
    # wrapper to collect return value in a Queue
    def task_wrapper(*args):
        result.put(target(*args))
    threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def dummy_task(n):
    for i in xrange(n):
        time.sleep(0.1)
    return n

# below is the application code
urls = [
    ('http://www.google.com/',),
    ('http://www.lycos.com/',),
    ('http://www.bing.com/',),
    ('http://www.altavista.com/',),
    ('http://achewood.com/',),
]

def fetch(url):
    return urllib2.urlopen(url).read()

run_parallel_in_threads(fetch, urls)

正如你可以看到，专用代码，只有3行，这可以折叠入1条，如果你是积极的。我不认为任何人都可以证明他们的权利要求，这是复杂和不可维护.

不幸的是其他大多数穿的代码发布这里有一些缺陷。他们中的许多做活跃投票，等待编码完成。 join() 一个更好的方式同步码。我认为，这种代码已经改善吁所有的线的例子为止。

保持活动连接

WoLpH的建议有关使用保持活动的连接可能是非常有用的，如果所有你的网址都指向同一服务器上。

扭曲的

艾伦*加拉格尔是一个迷 twisted 框架和他是敌对的任何人提出的螺纹。不幸的是，很多他的权利要求是错误信息。例如，他说："-1建议的螺纹。这是IO；线是无用的。" 此相反的证据，因为这两个尼克叔和我已经表明的速度增从使用螺纹。事实上I/O能应用的最有益使用蟒蛇的螺纹(第s.没有获得在CPU能的应用)。Aaron是误导的批评线表示他是混淆而不是有关并行程序。

正确的工具，用于正确的工作

我很清楚的问题涉及平行编程，使用螺纹，蟒蛇，异步I/O等。每个工具有其优点和缺点。对于每个情况有一个适当的工具。我不反对扭曲的(虽然我没有部署一个我自己).但我不相信我们可以平出说螺纹是坏和扭曲良好在所有情况。

例如，如果运算的要求是取10 000名网站平行，异步I/O将prefereable.穿不会appropriable(除非也许与stackless Python).

亚伦的反对线大都是概括。他没有认识到这是一个微不足道的并行任务。每个任务是独立的，不分享资源。所以他的大多数攻击并不适用。

给我代码有没有外部的依赖，我会打电话给它的权利的工具的权利的工作。

性能

我想大多数人都会同意绩效的这项任务在很大程度上取决于网络的代码和外部服务器，那里的业绩的平台码应该有可忽略的影响。但是亚伦的基准显示50%的速度增益超过螺纹代码。我认为这是必要的，以应对这种明显的速度增加。

在尼克的代码，有一个明显的缺陷，造成效率低下。但是你怎么解释233ms速获得在我的代码？我认为甚至扭转的球迷一定会不要跳到结论认为，这是效率的扭曲。还有，在所有以后，大量的变量系统外的代码，就像远程服务器的性能、网络、缓存和差执行之间的urllib2和扭曲的网客户等等。

只是为了确保Python线将不会产生大量的低效率，我做一个快速的基准，以产卵5螺纹然后500线。我很舒适的来说开销的产卵5线是可以忽略不计，并不能解释233ms速度的差异。

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>

In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s

In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

进一步的测试我的平行获取显示了巨大的变异性的响应时间在17个运行。(不幸的是我没有扭曲的要验证Aaron的代码)。

0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s

我的测试并不支持亚伦的结论，即穿始终低于异步I/O通过可衡量的保证金。鉴于这些变量的参与，我必须说这不是一个有效的测试，以测量系统的性能差异步I/O和穿线。

其他提示

采用扭曲呢与使用线程相比，它使这种事情变得非常容易。

from twisted.internet import defer, reactor
from twisted.web.client import getPage
import time

def processPage(page, url):
    # do somewthing here.
    return url, len(page)

def printResults(result):
    for success, value in result:
        if success:
            print 'Success:', value
        else:
            print 'Failure:', value.getErrorMessage()

def printDelta(_, start):
    delta = time.time() - start
    print 'ran in %0.3fs' % (delta,)
    return delta

urls = [
    'http://www.google.com/',
    'http://www.lycos.com/',
    'http://www.bing.com/',
    'http://www.altavista.com/',
    'http://achewood.com/',
]

def fetchURLs():
    callbacks = []
    for url in urls:
        d = getPage(url)
        d.addCallback(processPage, url)
        callbacks.append(d)

    callbacks = defer.DeferredList(callbacks)
    callbacks.addCallback(printResults)
    return callbacks

@defer.inlineCallbacks
def main():
    times = []
    for x in xrange(5):
        d = fetchURLs()
        d.addCallback(printDelta, time.time())
        times.append((yield d))
    print 'avg time: %0.3fs' % (sum(times) / len(times),)

reactor.callWhenRunning(main)
reactor.run()

该代码的性能也比发布的任何其他解决方案都更好（在我关闭了一些使用大量带宽的内容之后进行了编辑）：

Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 29996)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.518s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.461s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30033)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.435s
Success: ('http://www.google.com/', 8117)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.449s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.547s
avg time: 0.482s

并使用Nick T的代码，使其平均五个，并显示出更好的输出：

Starting threaded reads:
...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
Starting threaded reads:
...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
Starting threaded reads:
...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
Starting threaded reads:
...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
Starting threaded reads:
...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
avg time: 1.775s

Starting sequential reads:
...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
Starting sequential reads:
...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
Starting sequential reads:
...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
Starting sequential reads:
...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
Starting sequential reads:
...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
avg time: 1.439s

并使用Wai Yip Tung的代码：

Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30051 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.704s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.845s
Fetched 8153 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30070 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.689s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.647s
Fetched 8135 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30349 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.693s
avg time: 0.715s

我必须说，我确实喜欢执行的顺序获取 更好的 为了我。

这是使用Python的示例 Threads. 。这里的其他线程示例在这里启动每个URL线程，如果导致服务器的命中太多，这不是很友好的行为（例如，蜘蛛在同一主机上具有许多URL是很常见的）

from threading import Thread
from urllib2 import urlopen
from time import time, sleep

WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = []

class Worker(Thread):
    def run(self):
        while urls:
            url = urls.pop()
            results.append((url, urlopen(url).read()))

start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)

while len(results)<40:
    sleep(0.1)
print time()-start

注意：此处给出的时间是40个URL，将在很大程度上取决于您的Internet连接速度和服务器的延迟。在澳大利亚，我的ping> 300ms

和 WORKERS=1 运行花了86秒
和 WORKERS=4 跑步花了23秒
和 WORKERS=10 跑了10秒

因此，让10个线程下载的速度是单个线程的8.6倍。

这是使用队列的升级版本。至少有几个优势。
1.按照列表中出现的顺序，请求URL
2.可以使用 q.join() 检测请求何时完成
3.结果保持与URL列表相同的顺序

from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue

WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)

def worker():
    while True:
        i, url = q.get()
        # print "requesting ", i, url       # if you want to see what's going on
        results[i]=urlopen(url).read()
        q.task_done()

start = time()
q = Queue()
for i in range(WORKERS):
    t=Thread(target=worker)
    t.daemon = True
    t.start()

for i,url in enumerate(urls):
    q.put((i,url))
q.join()
print time()-start

实际等待可能不在 urllib2 但是在服务器和/或您的网络连接到服务器中。

有两种方法可以加速此操作。

保持连接的活力（请参阅有关如何执行此操作的问题： Python Urllib2与保持活力)
使用乘数连接，您可以按照Aaron Gallagher建议使用线程或异步方法。为此，只需使用任何线程示例，就应该做得很好:)您也可以使用 multiprocessing lib使事情变得很容易。

大多数答案都集中在同时从不同服务器获取多个页面（线程），而不是重复使用已经打开的HTTP连接。如果OP向同一服务器/站点提出多个请求。

在URLIB2中，与每个请求创建了一个单独的连接，从而影响性能，并因此较慢获取页面。 Urllib3通过使用连接池解决此问题。可以在这里阅读更多 Urllib3 也是线程安全

也有要求使用URLLIB3的HTTP库

与螺纹结合在一起，应提高获取页面的速度

如今，有出色的python lib为您做到这一点要求.

如果需要基于线程或异步API（在引擎盖下使用的GEVENT）的解决方案，请使用请求的标准API，如果您需要基于非块IO的解决方案。

由于这个问题已发布，看起来有一个更高的抽象可用， ThreadPoolExecutor:

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

为方便起见，这里的示例从这里粘贴了：

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

还有 map 我认为这使代码更容易： https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.executor.map

射线提供了一种优雅的方法来做到这一点（在Python 2和Python 3中）。 Ray是编写平行和分布式Python的库。

只需定义 fetch 功能 @ray.remote 装饰师。然后，您可以通过调用在后台获取URL fetch.remote(url).

import ray
import sys

ray.init()

@ray.remote
def fetch(url):
    if sys.version_info >= (3, 0):
        import urllib.request
        return urllib.request.urlopen(url).read()
    else:
        import urllib2
        return urllib2.urlopen(url).read()

urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
        'https://en.wikipedia.org/wiki/Barack_Obama',
        'https://en.wikipedia.org/wiki/George_W._Bush',
        'https://en.wikipedia.org/wiki/Bill_Clinton',
        'https://en.wikipedia.org/wiki/George_H._W._Bush']

# Fetch the webpages in parallel.
results = ray.get([fetch.remote(url) for url in urls])

如果您还想并行处理网页，则可以将处理代码直接放入 fetch, ，或者您可以定义一个新的远程函数并将它们组合在一起。

@ray.remote
def process(html):
    tokens = html.split()
    return set(tokens)

# Fetch and process the pages in parallel.
results = []
for url in urls:
    results.append(process.remote(fetch.remote(url)))
results = ray.get(results)

如果您想获取的URL列表很长，则可能希望发布一些任务，然后按照它们完成的顺序进行处理。您可以使用 ray.wait.

urls = 100 * urls  # Pretend we have a long list of URLs.
results = []

in_progress_ids = []

# Start pulling 10 URLs in parallel.
for _ in range(10):
    url = urls.pop()
    in_progress_ids.append(fetch.remote(url))

# Whenever one finishes, start fetching a new one.
while len(in_progress_ids) > 0:
    # Get a result that has finished.
    [ready_id], in_progress_ids = ray.wait(in_progress_ids)
    results.append(ray.get(ready_id))
    # Start a new task.
    if len(urls) > 0:
        in_progress_ids.append(fetch.remote(urls.pop()))

观看射线文档.

获取网页显然会花费一段时间，因为您无法访问本地任何内容。如果您有几个可以访问，则可以使用 threading 一次跑步的模块。

这是一个非常粗略的例子

import threading
import urllib2
import time

urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []

class PageFetch(threading.Thread):
    def __init__(self, url, datadump):
        self.url = url
        self.datadump = datadump
        threading.Thread.__init__(self)
    def run(self):
        page = urllib2.urlopen(self.url)
        self.datadump.append(page.read()) # don't do it like this.

print "Starting threaded reads:"
start = time.clock()
for url in urls:
    PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)

print "Starting sequential reads:"
start = time.clock()
for url in urls:
    page = urllib2.urlopen(url)
    data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)

for i,x in enumerate(data1):
    print len(data1[i]), len(data2[i])

这是我运行它的输出：

Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483

通过附加到列表，从线程中获取数据可能是不明智的（队列更好），但它说明存在差异。

这是标准库解决方案。它的速度不那么快，但是使用的内存少于螺纹解决方案。

try:
    from http.client import HTTPConnection, HTTPSConnection
except ImportError:
    from httplib import HTTPConnection, HTTPSConnection
connections = []
results = []

for url in urls:
    scheme, _, host, path = url.split('/', 3)
    h = (HTTPConnection if scheme == 'http:' else HTTPSConnection)(host)
    h.request('GET', '/' + path)
    connections.append(h)
for h in connections:
    results.append(h.getresponse().read())

另外，如果您的大多数请求都到同一主机，那么重复使用相同的HTTP连接可能会比并行做事更多。

请找到Python网络基准脚本以进行单连接缓慢标识：

"""Python network test."""
from socket import create_connection
from time import time

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen

TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))

和Python 3.6的结果的示例：

Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42

Python 2.7.13的结果非常相似。

在这种情况下，很容易识别DNS和urlopen缓慢。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow