最好的方式转换成一个Unicode URL为ASCII码(UTF-8%-逃脱)在蟒蛇?

https://stackoverflow.com/questions/804336

03-07-2019
|

题

我想知道什么是最好的办法-或者如果有一个简单的方式与标准的图书馆--转换的一个网址Unicode chars域中的名称和路径相当于ASCII网址，编域IDNA和路径%编码，作为每RFC3986.

我得到来自用户的网址在UTF-8。因此，如果他们已经输入 http://➡.ws/♥ 我得到 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5' 在蟒蛇。和我想出是ASCII版本： 'http://xn--hgi.ws/%E2%99%A5'.

我做什么的时刻分裂的URL成部分通过regex，然后手动IDNA编码域，并分别编码的路径和查询串不同 urllib.quote() 呼叫。

# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
                 r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
    raise BadURLException(url)
protocol, domain, port, path, query = match.groups()

try:
    domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
    return ''  # bad UTF-8 chars in domain
domain = domain.encode('idna')

if port is None:
    port = ''

path = urllib.quote(path)

if query is None:
    query = ''
else:
    query = urllib.quote(query, safe='=&?/')

url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'

这是正确的？任何更好的建议吗？是有一个简单的标准图书馆的功能做到这一点？

解决方案

代码:

import urlparse, urllib

def fixurl(url):
    # turn string into unicode
    if not isinstance(url,unicode):
        url = url.decode('utf8')

    # parse it
    parsed = urlparse.urlsplit(url)

    # divide the netloc further
    userpass,at,hostport = parsed.netloc.rpartition('@')
    user,colon1,pass_ = userpass.partition(':')
    host,colon2,port = hostport.partition(':')

    # encode each component
    scheme = parsed.scheme.encode('utf8')
    user = urllib.quote(user.encode('utf8'))
    colon1 = colon1.encode('utf8')
    pass_ = urllib.quote(pass_.encode('utf8'))
    at = at.encode('utf8')
    host = host.encode('idna')
    colon2 = colon2.encode('utf8')
    port = port.encode('utf8')
    path = '/'.join(  # could be encoded slashes!
        urllib.quote(urllib.unquote(pce).encode('utf8'),'')
        for pce in parsed.path.split('/')
    )
    query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
    fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))

    # put it back together
    netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
    return urlparse.urlunsplit((scheme,netloc,path,query,fragment))

print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123@➡.ws:81/admin')
print fixurl(u'http://➡.ws/admin')

输出：

http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:abc123@xn--hgi.ws:81/admin
http://xn--hgi.ws/admin

详细阅读：

编辑：

固定的情况下，已经引述的字符串。
改变了 urlparse/urlunparse 要 urlsplit/urlunsplit.
不用户编码和口信息的主机名称。(谢谢Jehiah)
当"@"是缺失的，不要把主机/口作为用户/用户通过！(谢谢hupf)

其他提示

MizardX给出的代码不是100％正确。这个例子不起作用：

example.com/folder/?page=2

查看django.utils.encoding.iri_to_uri（），将unicode网址转换为ASCII网址。

http://docs.djangoproject.com/en/dev/ref/unicode /

有一些RFC-3896 url解析工作正在进行中（例如，作为Summer Of Code的一部分）但标准库中没有任何东西，但AFAIK没有 - 并且 uri编码<事情的一面，AFAIK。所以你不妨使用MizardX的优雅方法。

好的，有了这些注释和我自己的代码中的一些错误修复（它根本没有处理片段），我想出了以下 canonurl（）函数 - 返回URL的规范ASCII形式：

import re import urllib import urlparse def canonurl(url): r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or '' if the URL looks invalid. >>> canonurl(' ') '' >>> canonurl('www.google.com') 'http://www.google.com/' >>> canonurl('bad-utf8.com/path\xff/file') '' >>> canonurl('svn://blah.com/path/file') 'svn://blah.com/path/file' >>> canonurl('1234://badscheme.com') '' >>> canonurl('bad$scheme://google.com') '' >>> canonurl('site.badtopleveldomain') '' >>> canonurl('site.com:badport') '' >>> canonurl('http://123.24.8.240/blah') 'http://123.24.8.240/blah' >>> canonurl('http://123.24.8.240:1234/blah?q#f') 'http://123.24.8.240:1234/blah?q#f' >>> canonurl('\xe2\x9e\xa1.ws') # tinyarro.ws 'http://xn--hgi.ws/' >>> canonurl(' http://www.google.com:80/path/file;params?query#fragment ') 'http://www.google.com:80/path/file;params?query#fragment' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') 'http://xn--hgi.ws/%E2%99%A5' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth') 'http://xn--hgi.ws/%E2%99%A5/pa/th' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth;par%2Fams?que%2Fry=a&b=c') 'http://xn--hgi.ws/%E2%99%A5/pa/th;par/ams?que/ry=a&b=c' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5?\xe2\x99\xa5#\xe2\x99\xa5') 'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5' >>> canonurl('http://\xe2\x9e\xa1.ws/%e2%99%a5?%E2%99%A5#%E2%99%A5') 'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5' >>> canonurl('http://badutf8pcokay.com/%FF?%FE#%FF') 'http://badutf8pcokay.com/%FF?%FE#%FF' >>> len(canonurl('google.com/' + 'a' * 16384)) 4096 """ # strip spaces at the ends and ensure it's prefixed with 'scheme://' url = url.strip() if not url: return '' if not urlparse.urlsplit(url).scheme: url = 'http://' + url # turn it into Unicode try: url = unicode(url, 'utf-8') except UnicodeDecodeError: return '' # bad UTF-8 chars in URL # parse the URL into its components parsed = urlparse.urlsplit(url) scheme, netloc, path, query, fragment = parsed # ensure scheme is a letter followed by letters, digits, and '+-.' chars if not re.match(r'[a-z][-+.a-z0-9]*, scheme, flags=re.I): return '' scheme = str(scheme) # ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port] match = re.match(r'(.+\.[a-z0-9]{1,6})(:\d{1,5})?, netloc, flags=re.I) if not match: return '' domain, port = match.groups() netloc = domain + (port if port else '') netloc = netloc.encode('idna') # ensure path is valid and convert Unicode chars to %-encoded if not path: path = '/' # eg: 'http://google.com' -> 'http://google.com/' path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;') # ensure query is valid query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/') # ensure fragment is valid fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8'))) # piece it all back together, truncating it to a maximum of 4KB url = urlparse.urlunsplit((scheme, netloc, path, query, fragment)) return url[:4096] if __name__ == '__main__': import doctest doctest.testmod()

您可以使用 urlparse.urlsplit 相反，但在其他方面你似乎有一个非常直接的解决方案。

protocol, domain, path, query, fragment = urlparse.urlsplit(url)

（您可以通过访问返回值的命名属性来单独访问域和端口，但由于端口语法始终为ASCII，因此不受IDNA编码过程的影响。）

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow