如何提取的顶级域名(TLD)从URL

https://stackoverflow.com/questions/1066933

21-08-2019
|

题

你会如何提取的域名从一个网址，不包括任何子?

我最初的简单化的企图是：

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这个工作 http://www.foo.com, 但不 http://www.foo.com.au.有没有办法做到这一不正确使用特别有关的知识有效的顶级域名(顶级域名)，或者国家代码(因为他们改变)。

感谢

解决方案

没有，就知道（例如）zap.co.it没有“内在”的方式是一个子域（因为意大利的注册商的确销售领域，如co.it），而zap.co.uk的不是的（因为英国的注册商不出售结构域，如co.uk，但只喜欢zap.co.uk）。

你只需要使用一个辅助表（或在线源）来告诉你哪些顶级域名的行为特有像英国和澳大利亚的 - 有没有占卜的方式，从在字符串只盯着没有这些额外的语义知识（中当然也可以最终改变，但如果你能找到一个很好的在线来源，消息人士还将会发生相应的变化，一个希望 - ！）

其他提示

下面是一个伟大的Python模块有人写看到这个问题后解决了这个问题： https://github.com/john-kurkowski/tldextract

在模块查找的TLD在公共后缀列表，由Mozilla志愿者编程和维持

引用：

在另一方面tldextract知道所有通用顶级域名[通用顶级域的] 和国家代码顶级域[国家代码顶级域的]的样子通过根据公共后缀仰视目前居住的人列表。所以，对于一个URL，它知道其域名的子域，它的域从它的国家代码。

使用此的有效的TLD文件别人Mozilla的网站发现：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果：

abcde.co.uk

我会很感激，如果有人让我知道它的上面位可以在一个更Python的方式被改写。例如，必须有遍历last_i_elements列表的一个更好的办法，但我不认为一个人的。我也不知道，如果ValueError是提高最好的事情。评论

使用Python tld

https://pypi.python.org/pypi/tld

安装

pip install tld

获取TLD名称作为从给定的URL字符串

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

或没有协议

from tld import get_tld get_tld("www.google.co.uk", fix_protocol=True)

co.uk

获取TLD作为对象

from tld import get_tld res = get_tld("http://some.subdomain.google.co.uk", as_object=True) res # 'co.uk' res.subdomain # 'some.subdomain' res.domain # 'google' res.tld # 'co.uk' res.fld # 'google.co.uk' res.parsed_url # SplitResult( # scheme='http', # netloc='some.subdomain.google.co.uk', # path='', # query='', # fragment='' # )

获取第一级域名作为从给定的URL字符串

from tld import get_fld get_fld("http://www.google.co.uk") # 'google.co.uk'

有很多很多的TLD的。这里的名单：

http://data.iana.org/TLD/tlds -alpha逐DOMAIN.txt文件

这里的另一个列表

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

这里的另一个列表

http://www.iana.org/domains/root/db/

下面是我如何处理它：

if not url.startswith('http'): url = 'http://'+url website = urlparse.urlparse(url)[1] domain = ('.').join(website.split('.')[-2:]) match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I) if not match: sys.exit(2) elif not match.group(0): sys.exit(2)

直到get_tld针对所有新的更新，我拉从错误的TLD。当然它的坏的代码，但它的工作原理。

def get_tld(): try: return get_tld(self.content_url) except Exception, e: re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!"); matchObj = re_domain.findall(str(e)) if matchObj: for m in matchObj: return m raise e

在Python我以前使用 tldextract 直到它失败，像www.mybrand.sa.com一个URL解析它作为subdomain='order.mybrand', domain='sa', suffix='com' !!

因此，最后，我决定写此方法

重要提示：这仅适用于那些在他们的子域网址。这并不意味着，以取代等更先进的库的 tldextract

def urlextract(url): url_split=url.split(".") if len(url_split) <= 2: raise Exception("Full url required with subdomain:",url) return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow