URL에서 최상위 도메인 이름 (TLD)을 추출하는 방법

https://stackoverflow.com/questions/1066933

21-08-2019
|

문제

하위 도메인을 제외하고 URL에서 도메인 이름을 어떻게 추출 하시겠습니까?

나의 초기 단순한 시도는 다음과 같습니다.

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

이것은 효과가 있습니다 http://www.foo.com, 하지만 http://www.foo.com.au. 유효한 TLD (최상위 도메인) 또는 국가 코드 (변경하기 때문에)에 대한 특별한 지식을 사용하지 않고 제대로 수행 할 수있는 방법이 있습니까?

감사해요

해결책

아니요, "고유 한"방법은 없습니다 (예 :) zap.co.it 하위 도메인입니다 (이탈리아의 등록 기관은 다음과 같은 도메인을 판매하기 때문에 co.it) 동안 zap.co.uk 그렇지 않습니다 (영국의 레지스트라는 다음과 같은 도메인을 판매하지 않기 때문에 co.uk, 그러나 만 좋아합니다 zap.co.uk).

영국과 호주와 같은 TLD가 특이하게 행동하는 보조 테이블 (또는 온라인 소스)을 사용해야합니다. 그러한 추가 의미 론적 지식없이 끈을 쳐다 보면서 분별하는 방법은 없습니다 (물론 가능할 수 있습니다. 결국 변경되지만 그에 따라 소스가 변경 될 좋은 온라인 소스를 찾을 수 있다면 희망이 있습니다!-).

다른 팁

다음은이 질문을 본 후 누군가 가이 문제를 해결하기 위해 쓴 훌륭한 파이썬 모듈입니다.https://github.com/john-kurkowski/tldextract

모듈은 TLD를 찾습니다 공개 접미사 목록, Mozilla 자원 봉사자들에 의해 화를 냈습니다

인용하다:

tldextract 반면에 모든 gtlds [일반적인 최상위 도메인]] 및 cctlds [국가 코드 최상위 도메인] 현재 살아있는 사람을 찾아서 공개 접미사 목록. 따라서 URL이 주어지면 도메인에서 하위 도메인과 국가 코드의 도메인을 알고 있습니다.

사용 효과적인 TLD 의이 파일 어느 다른 사람 Mozilla의 웹 사이트에서 발견 :

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

결과 :

abcde.co.uk

누군가 위의 어떤 비트를 더 피스닉 방식으로 다시 작성할 수 있는지 알려 주시면 감사합니다. 예를 들어, 더 나은 반복 방법이 있어야합니다. last_i_elements 목록, 그러나 나는 하나를 생각할 수 없었습니다. 나도 모른다 ValueError 키우는 것이 가장 좋은 것입니다. 코멘트?

파이썬 사용 tld

https://pypi.python.org/pypi/tld

설치

pip install tld

주어진 URL에서 tld 이름을 문자열로 가져옵니다.

from tld import get_tld
print get_tld("http://www.google.co.uk")

CO.UK

또는 프로토콜없이

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

CO.UK

TLD를 개체로 가져옵니다

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

주어진 URL에서 첫 번째 레벨 도메인 이름을 문자열로 가져옵니다.

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

많은 TLD가 있습니다. 목록은 다음과 같습니다.

http://data.iana.org/tld/tlds-alpha-by-domain.txt

다음은 다른 목록입니다

http://en.wikipedia.org/wiki/list_of_internet_top-level_domains

다음은 다른 목록입니다

http://www.iana.org/domains/root/db/

내가 처리하는 방법은 다음과 같습니다.

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

모든 새로운 것들에 대해 get_tld가 업데이트 될 때까지 오류에서 TLD를 가져옵니다. 물론 나쁜 코드이지만 작동합니다.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

파이썬에서 나는 사용했었다 tldextract URL과 같은 URL로 실패 할 때까지 www.mybrand.sa.com 그것을 파싱합니다 subdomain='order.mybrand', domain='sa', suffix='com'!!

마지막 으로이 방법을 작성하기로 결정했습니다

중요한 참고 : 이것은 하위 도메인이있는 URL에서만 작동합니다. 이것은 더 고급 라이브러리를 대체하기위한 것이 아닙니다 tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow