Python 및 BeautifulSoup을 사용하여 웹 페이지에서 링크 검색

https://stackoverflow.com/questions/1080411

22-08-2019
|

문제

Python을 사용하여 웹페이지의 링크를 검색하고 링크의 URL 주소를 복사하려면 어떻게 해야 합니까?

해결책

다음은 BeautifulSoup의 Soupstrainer 클래스를 사용하는 짧은 스 니펫입니다.

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup 문서는 실제로 매우 좋으며 여러 가지 일반적인 시나리오를 다룹니다.

http://www.crummy.com/software/beautifulsoup/documentation.html

편집 : 미리 구문 분석하는 것을 알고 있다면 수프 스트레인 클래스가 조금 더 효율적이기 때문에 (메모리와 속도 현명한) 사용했기 때문입니다.

다른 팁

완전성을 위해 BeautifulSoup 4 버전으로 서버에서 제공하는 인코딩을 사용합니다.

from bs4 import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

또는 Python 3 버전 :

from bs4 import BeautifulSoup
import urllib.request

resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

그리고 The를 사용하는 버전 requests 도서관, 서면으로 Python 2와 3에서 작용합니다.

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

그만큼 soup.find_all('a', href=True) 전화는 모두 찾습니다 <a> an이있는 요소 href 기인하다; 속성이없는 요소가 건너 뜁니다.

BeautifulSoup 3은 2012 년 3 월 개발 중단; 새로운 프로젝트는 실제로 BeautifulSoup 4를 항상 사용해야합니다.

바이트에서 html을 디코딩해야합니다. BeautifulSoup에. Decoding을 돕기 위해 HTTP 응답 헤더에있는 문자 세트의 BeautifulSoup을 알릴 수 있지만 ~할 수 있다 틀렸고 a <meta> HTML 자체에있는 헤더 정보이므로 위의 내부 클래스 방법을 사용하는 이유입니다. EncodingDetector.find_declared_encoding() 이러한 내장 인코딩 힌트가 잘못 구성된 서버를 통해 승리하는지 확인합니다.

와 함께 requests,, response.encoding 응답에 기본값이 라틴어 -1에 대한 속성 a text/* 문자 세트가 반환되지 않더라도 Mimetype. 이것은 HTTP RFC와 일치하지만 HTML 파싱과 함께 사용하면 고통 스럽기 때문에 해당 속성을 무시해야합니다. charset 컨텐츠 유형 헤더에 설정됩니다.

다른 사람들은 BeautifulSoup을 추천했지만 사용하는 것이 훨씬 낫습니다. LXML. 그 이름에도 불구하고, 그것은 또한 HTML을 구문 분석하고 긁는 것입니다. BeautifulSoup보다 훨씬 빠르며 BeautifulSoup (명성에 대한 주장)보다 "깨진"HTML을 더 잘 처리합니다. LXML API를 배우고 싶지 않다면 BeautifulSoup의 호환 API도 있습니다.

Ian Blicking은 동의합니다.

Google App Engine에 있거나 순전히 Python이 허용되지 않는 경우 더 이상 BeautifulSoup을 사용할 이유가 없습니다.

lxml.html도 CSS3 선택기를 지원하므로 이러한 종류의 일은 사소합니다.

LXML 및 XPath의 예는 다음과 같습니다.

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

다음 코드는 웹 페이지에서 사용 가능한 모든 링크를 검색하는 것입니다. urllib2 그리고 BeautifulSoup4:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

후드 아래 BeautifulSoup은 이제 LXML을 사용합니다. 요청, LXML 및 목록 이해력은 킬러 콤보를 만듭니다.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

목록 comp에서 "if '//'및 'url.com'은 X에 있지 않음"은 사이트 '내부'내비게이션 URL 등의 URL 목록을 문지르는 간단한 방법입니다.

모든 링크를 찾으려면이 예에서는 rlib2 모듈을 Re.module과 함께 사용합니다.*Re 모듈에서 가장 강력한 기능 중 하나는 "re.findall ()"입니다. re.search ()는 패턴의 첫 번째 일치를 찾는 데 사용되지만 Re.findall ()을 찾습니다. 모두일치하는 것은 문자열 목록으로 반환하고, 각 문자열은 하나의 일치를 나타냅니다**

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

B.Soup 및 Regex없이 링크를 얻는 것만으로도 :

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

더 복잡한 작업의 경우 물론 BSOUP가 여전히 선호됩니다.

정규 표현식을 사용하지 않는 이유 :

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

이 스크립트는 귀하가 찾고있는 작업을 수행하지만 절대 링크에 대한 상대적 링크를 해결합니다.

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

링크는 다양한 속성 내에 있을 수 있으므로 해당 속성 목록을 전달하여 선택할 수 있습니다.

예를 들어 src 및 href 속성이 있습니다. 여기서는 ^ 연산자로 시작을 사용하여 이러한 속성 값 중 하나가 http로 시작하도록 지정합니다.필요에 따라 이를 맞춤화할 수 있습니다.

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

속성 = 값 선택자

[속성^=값]

값이 값 앞에 붙는 속성 이름 attr을 갖는 요소를 나타냅니다.

다음은 @ARS 수락 된 답변을 사용하는 예입니다. BeautifulSoup4, requests, 그리고 wget 다운로드를 처리하는 모듈.

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

다음과 같은 수정 후 @blairg23의 답변을 찾았습니다 (올바르게 작동하지 않은 시나리오를 다루고 있음).

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

파이썬 3 :

urllib.parse.urljoin 대신 전체 URL을 얻으려면 사용해야합니다.

Beatifulsoup의 자체 파서는 느릴 수 있습니다. 사용하기가 더 가능할 수 있습니다 LXML URL에서 직접 구문 분석 할 수 있습니다 (아래 언급 된 일부 제한 사항).

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

위의 코드는 링크를 그대로 반환하며 대부분의 경우 사이트 루트에서 상대적 링크 또는 절대입니다. 내 유스 케이스는 특정 유형의 링크 만 추출하는 것이 었으므로 아래는 링크를 전체 URL로 변환하고 선택적으로 GLOB 패턴을 수용하는 버전입니다. *.mp3. 그것은 상대 경로에서 단일 및 이중 점을 처리하지는 않지만 지금까지는 필요하지 않았습니다. 포함 된 URL 조각을 구문 분석 해야하는 경우 ../ 또는 ./ 그 다음에 urlparse.urljoin 유용 할 수 있습니다.

노트: Direct LXML URL 파싱은로드를 처리하지 않습니다 https 그리고 리디렉션을 수행하지 않으므로 아래 버전이 사용 중입니다. urllib2 + lxml.

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

사용법은 다음과 같습니다.

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow