링크의 루트 도메인을 얻으십시오

https://stackoverflow.com/questions/1521592

19-09-2019
|

문제

나는 다음과 같은 링크가 있습니다 http://www.techcrunch.com/ 그리고 TechCrunch.com을 링크의 일부만 얻고 싶습니다. 파이썬에서 이것에 대해 어떻게하려고합니까?

해결책

호스트 이름을 얻는 것만으로는 쉽습니다 Urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

그러나 "루트 도메인"을 얻는 것은 구문 적 의미로 정의되지 않기 때문에 더 문제가 될 것입니다. "www.theeregister.co.uk"의 루트 도메인은 무엇입니까? 기본 도메인을 사용하는 네트워크는 어떻습니까? "Devbox12"는 유효한 호스트 이름 일 수 있습니다.

이것을 처리하는 한 가지 방법은 공개 접미사 목록, 실제 최상위 도메인 (예 : ".com", ".net", ".org")과 개인 도메인을 모두 카탈로그하려고 시도합니다. 사용된 tlds (예 : ".co.uk"또는 ".github.io")처럼. Python에서 Python에서 PSL에 액세스 할 수 있습니다 publicsuffix2 도서관:

import publicsuffix
import urlparse

def get_base_domain(url):
    # This causes an HTTP request; if your script is running more than,
    # say, once a day, you'd want to cache it yourself.  Make sure you
    # update frequently, though!
    psl = publicsuffix.fetch()

    hostname = urlparse.urlparse(url).hostname

    return publicsuffix.get_public_suffix(hostname, psl)

다른 팁

URL의 일반 구조 :

체계 : // netloc/path; 매개 변수? 쿼리#조각

처럼 Timtowtdi 금언:

사용 Urlparse,

>>> from urllib.parse import urlparse  # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '')  # as per your case
>>> print(result)
'stackoverflow.com/'

사용 tldextract,

>>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

귀하의 경우 :

>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'

tldextract 반면에 모든 gtlds [일반적인 최상위 도메인]] 및 cctlds [국가 코드 최상위 도메인] 공개 접미사 목록에 따라 현재 살아있는 것들을 찾아 보는 것 같습니다. 따라서 URL이 주어지면 도메인에서 하위 도메인과 국가 코드의 도메인을 알고 있습니다.

안녕! :)

다음 스크립트는 완벽하지는 않지만 디스플레이/단축 목적으로 사용할 수 있습니다. 제 3 자 종속성, 특히 일부 TLD 데이터를 원격으로 가져오고 캐싱 해야하는 경우 프로젝트에서 사용하는 스크립트를 따르는 것이 좋습니다. 가장 일반적인 도메인 확장에 도메인의 마지막 두 부분을 사용하고 덜 알려진 도메인 확장의 나머지 부분에 대해 마지막 세 부분을 남겨 둡니다. 최악의 경우 시나리오 도메인에는 두 가지 대신 세 부분이 있습니다.

from urlparse import urlparse

def extract_domain(url):
    parsed_domain = urlparse(url)
    domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme
    domain_parts = domain.split('.')
    if len(domain_parts) > 2:
        return '.'.join(domain_parts[-(2 if domain_parts[-1] in {
            'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'} else 3):])
    return domain

extract_domain('google.com')          # google.com
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.google.com') # google.com
extract_domain('google.co.uk')        # google.co.uk
extract_domain('sub.google.co.uk')    # google.co.uk
extract_domain('www.google.com')      # google.com
extract_domain('sub.sub2.voila.fr')   # sub2.voila.fr

______________________가 아닌 Python 3.3

Ben Blank의 대답에 작은 것을 추가하고 싶습니다.

from urllib.parse import quote,unquote,urlparse
u=unquote(u) #u= URL e.g. http://twitter.co.uk/hello/there
g=urlparse(u)
u=g.netloc

지금까지 나는 방금 도메인 이름을 얻었습니다. Urlparse.

하위 도메인을 제거하려면 먼저 최상위 도메인이 아닌지 알아야합니다. 위의 예를 들어 http://twitter.co.uk - co.uk 들어있는 동안 TLD입니다 http://sub.twitter.com 우리는 전용입니다 .com tld와 sub 하위 도메인입니다.

따라서 모든 것을 가진 파일/목록을 가져와야합니다. TLD.

tlds = load_file("tlds.txt") #tlds holds the list of tlds

hostname = u.split(".")
if len(hostname)>2:
    if hostname[-2].upper() in tlds:
        hostname=".".join(hostname[-3:])
    else:
        hostname=".".join(hostname[-2:])
else:
    hostname=".".join(hostname[-2:])

def get_domain(url):
    u = urlsplit(url)
    return u.netloc

def get_top_domain(url):
    u"""
    >>> get_top_domain('http://www.google.com')
    'google.com'
    >>> get_top_domain('http://www.sina.com.cn')
    'sina.com.cn'
    >>> get_top_domain('http://bbc.co.uk')
    'bbc.co.uk'
    >>> get_top_domain('http://mail.cs.buaa.edu.cn')
    'buaa.edu.cn'
    """
    domain = get_domain(url)
    domain_parts = domain.split('.')
    if len(domain_parts) < 2:
        return domain
    top_domain_parts = 2
    # if a domain's last part is 2 letter long, it must be country name
    if len(domain_parts[-1]) == 2:
        if domain_parts[-1] in ['uk', 'jp']:
            if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
                top_domain_parts = 3
        else:
            if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
                top_domain_parts = 3
    return '.'.join(domain_parts[-top_domain_parts:])

이것은 내 목적을 위해 효과가있었습니다. 나는 그것을 공유 할 것이라고 생각했다.

".".join("www.sun.google.com".split(".")[-2:])

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow