检索使用Python和BeautifulSoup从网页链接

https://stackoverflow.com/questions/1080411

22-08-2019
|

题

如何检索网页的链接，并复制使用Python的链接的URL地址？

解决方案

下面是一个使用SoupStrainer类在BeautifulSoup一小片段：

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

在BeautifulSoup文档实际上是相当好的，并且覆盖了一些典型的方案：

http://www.crummy.com/software/BeautifulSoup/documentation.html

编辑：请注意，我用的SoupStrainer类，因为它是一个有点更有效（内存和速度明智的），如果你知道你想提前解析什么

其他提示

有关完整性起见，BeautifulSoup 4版，利用由服务器提供的，以及所述编码的：

from bs4 import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

或Python的版本3：

from bs4 import BeautifulSoup
import urllib.request

resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

和使用 requests库，该书面将同时Python 2和3的工作的一个版本：

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

在soup.find_all('a', href=True)呼叫发现具有<a>属性的所有href元件;无属性的元素被跳过。

BeautifulSoup 3在2012年3月停止发展;新开工项目真正应该使用BeautifulSoup 4，始终。

请注意，你应该离开HTML解码从字节的以BeautifulSoup 的。您可以告知BeautifulSoup在HTTP响应报头中的字符集，以协助解码，但是这个的可以的是错误的和相互冲突与在HTML本身发现<meta>头信息，这也是为什么上述用途， BeautifulSoup内部类方法EncodingDetector.find_declared_encoding()以确保这样的嵌入式编码的提示拉拢错误配置的服务器。

通过requests，所述response.encoding属性默认为Latin-1的，如果响应具有text/* MIME类型，即使未返回字符集。与HTML解析使用时，这是与HTTP的RFC一致的，但痛苦，因此应该忽略该属性时，没有charset在Content-Type头被设置。

还有人建议BeautifulSoup，但更好的做法是使用 LXML 。尽管它的名字，它也为分析和刮的HTML。这是多少，比BeautifulSoup快得多，它甚至处理“破” HTML比BeautifulSoup更好（其声名鹊起）。它有BeautifulSoup太多，如果你不想学习lxml的API兼容的API。

伊恩Blicking同意

有没有理由使用BeautifulSoup了，除非你在哪里什么不是纯粹的Python是不允许的谷歌应用程序引擎或什么的。

lxml.html还支持CSS3选择所以这样的事情是微不足道的。

<强>使用LXML和XPath的例子应该是这样的：

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

下面的代码是检索所有使用urllib2和BeautifulSoup4在网页中可用的链路：

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

引擎盖下BeautifulSoup现在使用LXML。请求，LXML＆列表解析使得一个杀手组合。

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在列表排版中，“如果‘//’和‘url.com’不在x”是一个简单的方法来擦洗站点“内部”导航URL的URL列表等

要找到所有的联系，我们将在这个例子中使用的urllib2模块一起与re.module *一重模块中功能最强大的是“re.findall（）”。虽然re.search（）被用于查找一个图案的第一匹配，re.findall（）发现的所有火柴并返回它们作为字符串的列表，与代表一个匹配的每个字符串*

import urllib2 import re #connect to a URL website = urllib2.urlopen(url) #read html code html = website.read() #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html) print links

只是用于获取链接，而不B.soup和正则表达式：

import urllib2 url="http://www.somewhere.com" page=urllib2.urlopen(url) data=page.read().split("</a>") tag="<a href=\"" endtag="\">" for item in data: if "<a href" in item: try: ind = item.index(tag) item=item[ind+len(tag):] end=item.index(endtag) except: pass else: print item[:end]

有更复杂的操作，当然BSoup仍优选的。

为什么不使用正则表达式：

import urllib2 import re url = "http://www.somewhere.com" page = urllib2.urlopen(url) page = page.read() links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page) for link in links: print('href: %s, HTML text: %s' % (link[0], link[1]))

该脚本做你想找的东西，但是也解析为绝对链接的相对链接。

import urllib import lxml.html import urlparse def get_dom(url): connection = urllib.urlopen(url) return lxml.html.fromstring(connection.read()) def get_links(url): return resolve_links((link for link in get_dom(url).xpath('//a/@href'))) def guess_root(links): for link in links: if link.startswith('http'): parsed_link = urlparse.urlparse(link) scheme = parsed_link.scheme + '://' netloc = parsed_link.netloc return scheme + netloc def resolve_links(links): root = guess_root(links) for link in links: if not link.startswith('http'): link = urlparse.urljoin(root, link) yield link for link in get_links('http://www.google.com'): print link

链接可以是各种属性的内，所以你可以通过这些属性的列表，以选择

例如，用src和href属性（这里我使用的开始与^操作者指定其中一个属性值以http开始。您可以定制此根据需要

from bs4 import BeautifulSoup as bs import requests r = requests.get('https://stackoverflow.com/') soup = bs(r.content, 'lxml') links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ] print(links)

属性=值选择器

[ATTR ^ =值]

表示使用attr其值的属性名称元素由值前缀（开头）。

下面是一个使用@ars接受的答案和BeautifulSoup4，requests和wget模块来处理所述下载的例子。

import requests import wget import os from bs4 import BeautifulSoup, SoupStrainer url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/' file_type = '.tar.gz' response = requests.get(url) for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): if link.has_attr('href'): if file_type in link['href']: full_path = url + link['href'] wget.download(full_path)

我发现@ Blairg23工作，下面的修正后的答案（覆盖场景下未能正常工作）：

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): if link.has_attr('href'): if file_type in link['href']: full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported wget.download(full_path)

有关的Python 3：

urllib.parse.urljoin已到，为了获得完整的URL，而不是使用。

BeatifulSoup自身解析器可能会很慢。这可能是更可行的使用 LXML 其能够直接从URL解析的（有一些限制下面提到）。

import lxml.html doc = lxml.html.parse(url) links = doc.xpath('//a[@href]') for link in links: print link.attrib['href']

上面的代码将返回链接作为是，在大多数情况下，它们将是从站点根相对链接或绝对的。由于我的用例是只提取一个特定类型的链路，下面的是，链接转换为完全的URL和其任选地接受像*.mp3一个水珠图案的版本。它不会在相对路径处理单，双点，虽然，但到目前为止，我没有为它的需要。如果您需要解析包含../或./网址片段，然后 urlparse.urljoin 可能派上用场。

注意：直接LXML URL解析不处理从https加载，因此这个原因下面的版本使用urllib2 + lxml没有做重定向，

#!/usr/bin/env python import sys import urllib2 import urlparse import lxml.html import fnmatch try: import urltools as urltools except ImportError: sys.stderr.write('To normalize URLs run: `pip install urltools --user`') urltools = None def get_host(url): p = urlparse.urlparse(url) return "{}://{}".format(p.scheme, p.netloc) if __name__ == '__main__': url = sys.argv[1] host = get_host(url) glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*' doc = lxml.html.parse(urllib2.urlopen(url)) links = doc.xpath('//a[@href]') for link in links: href = link.attrib['href'] if fnmatch.fnmatch(href, glob_patt): if not href.startswith(('http://', 'https://' 'ftp://')): if href.startswith('/'): href = host + href else: parent_url = url.rsplit('/', 1)[0] href = urlparse.urljoin(parent_url, href) if urltools: href = urltools.normalize(href) print href

用法如下：

getlinks.py http://stackoverflow.com/a/37758066/191246 getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*" getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

import urllib2 from bs4 import BeautifulSoup a=urllib2.urlopen('http://dir.yahoo.com') code=a.read() soup=BeautifulSoup(code) links=soup.findAll("a") #To get href part alone print links[0].attrs['href']

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow