PythonでUnicode URLをASCII（UTF-8パーセントエスケープ）に変換する最良の方法は？

https://stackoverflow.com/questions/804336

03-07-2019
|

質問

ドメイン名とパスにUnicode文字を含むURLを、IDNAとしてドメインでエンコードされた同等のASCII URLに変換するための最良の方法、または標準ライブラリを使用した簡単な方法がある場合、 RFC 3986による％エンコードされたパス。

ユーザーからUTF-8のURLを取得します。したがって、 http：//＆＃10145; .ws /＆＃9829; と入力した場合、 'http：// \ xe2 \ x9e \ xa1.ws/ \ xe2が表示されますPythonの\ x99 \ xa5 'そして、私が欲しいのはASCIIバージョンです： 'http://xn--hgi.ws/%E2%99%A5' 。

現時点で行うことは、正規表現を介してURLを部分に分割し、ドメインを手動でIDNAエンコードし、異なる urllib.quote（）呼び出し。



# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
                 r'(:\d{1,5})?(/.*?)(\?.*)?

これは正しいですか？より良い提案はありますか？これを行う簡単な標準ライブラリ関数はありますか？, url, flags=re.I)
if not match:
    raise BadURLException(url)
protocol, domain, port, path, query = match.groups()

try:
    domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
    return ''  # bad UTF-8 chars in domain
domain = domain.encode('idna')

if port is None:
    port = ''

path = urllib.quote(path)

if query is None:
    query = ''
else:
    query = urllib.quote(query, safe='=&?/')

url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'


これは正しいですか？より良い提案はありますか？これを行う簡単な標準ライブラリ関数はありますか？


	
		
					
				
					
						

	
		
			
				




				




			
			
				役に立ちましたか？				 
					 
						 
						
					
							
				
				
					 
						 
						
					
					
				
				
			
			
				
					
				
			
		
	


					
					
											
				
				
	
		
			 解決 		
		
			コード：

import urlparse, urllib

def fixurl(url):
    # turn string into unicode
    if not isinstance(url,unicode):
        url = url.decode('utf8')

    # parse it
    parsed = urlparse.urlsplit(url)

    # divide the netloc further
    userpass,at,hostport = parsed.netloc.rpartition('@')
    user,colon1,pass_ = userpass.partition(':')
    host,colon2,port = hostport.partition(':')

    # encode each component
    scheme = parsed.scheme.encode('utf8')
    user = urllib.quote(user.encode('utf8'))
    colon1 = colon1.encode('utf8')
    pass_ = urllib.quote(pass_.encode('utf8'))
    at = at.encode('utf8')
    host = host.encode('idna')
    colon2 = colon2.encode('utf8')
    port = port.encode('utf8')
    path = '/'.join(  # could be encoded slashes!
        urllib.quote(urllib.unquote(pce).encode('utf8'),'')
        for pce in parsed.path.split('/')
    )
    query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
    fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))

    # put it back together
    netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
    return urlparse.urlunsplit((scheme,netloc,path,query,fragment))

print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123@➡.ws:81/admin')
print fixurl(u'http://➡.ws/admin')


出力：


    http://xn--hgi.ws/%E2%99%A5  

   http://xn--hgi.ws/%E2%99%A5/%2F  

   http：//％C3％85sa：abc123@xn--hgi.ws：81 / admin  

   http://xn--hgi.ws/admin  


詳細：


  urllib.quote（） 
  urlparse.urlparse（） 
  urlparse.urlunparse（） 
  urlparse.urlsplit（） 
  urlparse.urlunsplit（） 


編集：


既に引用符で囲まれた文字列の大文字と小文字を修正しました。
  urlparse  /  urlunparse を urlsplit  /  urlunsplit に変更しました。
ホスト名でユーザーとポートの情報をエンコードしないでください。 （エヒアに感謝）
＆quot; @＆quot;の場合ホスト/ポートをユーザー/パスとして扱わないでください！ （ありがとうhupf）



	
					
			
			


	
			


	
			
						 他のヒント
			
			
	
		
	
	
			 MizardXが提供するコードは100％正確ではありません。この例は機能しません：

 example.com/folder/?page=2 

 django.utils.encoding.iri_to_uri（）をチェックして、Unicode URLをASCII URLに変換します。

  http://docs.djangoproject.com/en/dev/ref/unicode /  
	


	
		
	
	
			いくつかのRFC-3896  url解析が進行中です（たとえば、Summer Of Codeの一部として）が、標準ライブラリにはまだ何もありません- uriエンコーディングにはほとんど何もありません物事のどちらかの側面、再び知る限り。したがって、MizardXのエレガントなアプローチを使用することもできます。
	


	
		
	
	
			 OK正規のASCII形式のURL：

import re
import urllib
import urlparse

def canonurl(url):
    r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or ''
    if the URL looks invalid.

    >>> canonurl('    ')
    ''
    >>> canonurl('www.google.com')
    'http://www.google.com/'
    >>> canonurl('bad-utf8.com/path\xff/file')
    ''
    >>> canonurl('svn://blah.com/path/file')
    'svn://blah.com/path/file'
    >>> canonurl('1234://badscheme.com')
    ''
    >>> canonurl('bad$scheme://google.com')
    ''
    >>> canonurl('site.badtopleveldomain')
    ''
    >>> canonurl('site.com:badport')
    ''
    >>> canonurl('http://123.24.8.240/blah')
    'http://123.24.8.240/blah'
    >>> canonurl('http://123.24.8.240:1234/blah?q#f')
    'http://123.24.8.240:1234/blah?q#f'
    >>> canonurl('\xe2\x9e\xa1.ws')  # tinyarro.ws
    'http://xn--hgi.ws/'
    >>> canonurl('  http://www.google.com:80/path/file;params?query#fragment  ')
    'http://www.google.com:80/path/file;params?query#fragment'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
    'http://xn--hgi.ws/%E2%99%A5'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth')
    'http://xn--hgi.ws/%E2%99%A5/pa/th'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth;par%2Fams?que%2Fry=a&b=c')
    'http://xn--hgi.ws/%E2%99%A5/pa/th;par/ams?que/ry=a&b=c'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5?\xe2\x99\xa5#\xe2\x99\xa5')
    'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5'
    >>> canonurl('http://\xe2\x9e\xa1.ws/%e2%99%a5?%E2%99%A5#%E2%99%A5')
    'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5'
    >>> canonurl('http://badutf8pcokay.com/%FF?%FE#%FF')
    'http://badutf8pcokay.com/%FF?%FE#%FF'
    >>> len(canonurl('google.com/' + 'a' * 16384))
    4096
    """
    # strip spaces at the ends and ensure it's prefixed with 'scheme://'
    url = url.strip()
    if not url:
        return ''
    if not urlparse.urlsplit(url).scheme:
        url = 'http://' + url

    # turn it into Unicode
    try:
        url = unicode(url, 'utf-8')
    except UnicodeDecodeError:
        return ''  # bad UTF-8 chars in URL

    # parse the URL into its components
    parsed = urlparse.urlsplit(url)
    scheme, netloc, path, query, fragment = parsed

    # ensure scheme is a letter followed by letters, digits, and '+-.' chars
    if not re.match(r'[a-z][-+.a-z0-9]*, scheme, flags=re.I):
        return ''
    scheme = str(scheme)

    # ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port]
    match = re.match(r'(.+\.[a-z0-9]{1,6})(:\d{1,5})?, netloc, flags=re.I)
    if not match:
        return ''
    domain, port = match.groups()
    netloc = domain + (port if port else '')
    netloc = netloc.encode('idna')

    # ensure path is valid and convert Unicode chars to %-encoded
    if not path:
        path = '/'  # eg: 'http://google.com' -> 'http://google.com/'
    path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;')

    # ensure query is valid
    query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/')

    # ensure fragment is valid
    fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8')))

    # piece it all back together, truncating it to a maximum of 4KB
    url = urlparse.urlunsplit((scheme, netloc, path, query, fragment))
    return url[:4096]

if __name__ == '__main__':
    import doctest
    doctest.testmod()
	


	
		
	
	
			   urlparse.urlsplit  代わりに、そうでなければ、非常に簡単な解決策があるようです。


protocol, domain, path, query, fragment = urlparse.urlsplit(url)


（戻り値の名前付きプロパティにアクセスすることで、ドメインとポートに個別にアクセスできますが、ポート構文は常にASCIIであるため、IDNAエンコーディングプロセスの影響を受けません。）
	


			

		

			



	
		
			ライセンス： CC-BY-SA と 帰属
			所属していません StackOverflow