는 방법을 맺는 urlencoded 유니코드 문자열에는 파이썬?

https://stackoverflow.com/questions/300445

08-07-2019
|

문제

나는 유니코드 문자열은 다음과 같"Tanım"를 인코딩으로"탄%u0131m"어떻게든.할 수 있는 방법으로 변환이 인코딩된 문자열을 다시 원래 있습니다.분명히 urllib.맺을 지원하지 않는 유니코드를 기반으로 합니다.

해결책

%uxxxx는 a 비표준 인코딩 체계 이는 W3C에 의해 거부되었다.

더 일반적인 기술은 UTF-8이 문자열을 인코딩 한 다음 % xx를 사용하여 결과 바이트를 탈출하는 것 같습니다. 이 체계는 urllib.unquote에 의해 뒷받침됩니다.

>>> urllib2.unquote("%0a")
'\n'

불행히도, 당신이 정말로 필요 %uxxxx를 지원하려면 아마도 자신의 디코더를 굴려야 할 것입니다. 그렇지 않으면, 단순히 UTF-8을 유니 코드로 인코딩 한 다음 결과 바이트를 탈출하는 것이 훨씬 더 바람직 할 것입니다.

보다 완전한 예 :

>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'

다른 팁

def unquote(text):
    def unicode_unquoter(match):
        return unichr(int(match.group(1),16))
    return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)

이것은 당신이 절대적으로 이것을 가져야한다면 그렇게 할 것입니다 (나는 실제로 "비표준"의 울음에 동의합니다) :

from urllib import unquote

def unquote_u(source):
    result = unquote(source)
    if '%u' in result:
        result = result.replace('%u','\\u').decode('unicode_escape')
    return result

print unquote_u('Tan%u0131m')

> Tanım

위 버전에는 문자열에 ASCII 인코딩 및 유니 코드 인코딩 된 문자가 모두있을 때 때때로 놀라게하는 버그가 있습니다. 유니 코드 외에 ' xAB'와 같은 상단 128 범위의 문자가있을 때 구체적으로 생각합니다.

예를 들어. "%5B%AB%U03E1%BB%5D"는이 오류를 일으킨다.

당신이 먼저 유니 코드를 한 경우, 문제가 사라 졌다는 것을 알았습니다.

def unquote_u(source):
  result = source
  if '%u' in result:
    result = result.replace('%u','\\u').decode('unicode_escape')
  result = unquote(result)
  return result

당신이 사용하여 URL 비 표준의 인코딩 방식, 거절한 표준 단체전에 의해 생산되고 있는 일부 인코더.Python urllib.parse.unquote() 수 처리할 수 없습니다.

자신의 작성을 디코더를 그렇게 어렵지 않습니다. %uhhhh 항목은 의미할 UTF-16 코드,여기에 그래서 우리는 우리를 취할 필요 대리 쌍 니다.나는 또한 본 %hh 코드포인트이 혼합되어 혼란이 있습니다.

그 마음에,여기에 암호해독기에서 작동하는 모두 파이썬 2Python3 을 제공,당신은 전달 str 체에서 Python3(Python2 관심):

try:
    # Python 3
    from urllib.parse import unquote
    unichr = chr
except ImportError:
    # Python 2
    from urllib import unquote

def unquote_unicode(string, _cache={}):
    string = unquote(string)  # handle two-digit %hh components first
    parts = string.split(u'%u')
    if len(parts) == 1:
        return parts
    r = [parts[0]]
    append = r.append
    for part in parts[1:]:
        try:
            digits = part[:4].lower()
            if len(digits) < 4:
                raise ValueError
            ch = _cache.get(digits)
            if ch is None:
                ch = _cache[digits] = unichr(int(digits, 16))
            if (
                not r[-1] and
                u'\uDC00' <= ch <= u'\uDFFF' and
                u'\uD800' <= r[-2] <= u'\uDBFF'
            ):
                # UTF-16 surrogate pair, replace with single non-BMP codepoint
                r[-2] = (r[-2] + ch).encode(
                    'utf-16', 'surrogatepass').decode('utf-16')
            else:
                append(ch)
            append(part[4:])
        except ValueError:
            append(u'%u')
            append(part)
    return u''.join(r)

함수에 의해 영감을 많 현재 표준-구현 라이브러리.

Demo:

>>> print(unquote_unicode('Tan%u0131m'))
Tanım
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
איך ממירים את הטקסט הזה
>>> print(unquote_unicode('%ud83c%udfd6'))  # surrogate pair
🏖
>>> print(unquote_unicode('%ufoobar%u666'))  # incomplete
%ufoobar%u666

함수에서 작동하는 파이썬 2(에서 테스트 2.4-2.7)및 Python3(에서 테스트 3.3-3.8).

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow