Python 2.6에서 unicode_literals를 사용하는 gotchas?

https://stackoverflow.com/questions/809796

03-07-2019
|

문제

우리는 이미 Python 2.6에서 코드 기반을 실행했습니다. Python 3.0을 준비하기 위해 다음을 추가하기 시작했습니다.

from __future__ import unicode_literals

우리로 .py 파일 (수정대로). 다른 사람 이이 일을 해왔는지 궁금해하고 (아마도 많은 시간을 디버깅 한 후에) 끔찍한 gotchas에 도달했는지 궁금합니다.

해결책

유니 코드 문자열로 작업 한 주요 문제의 주요 원인은 UTF-8 인코딩 문자열과 유니 코드를 혼합 할 때입니다.

예를 들어 다음 스크립트를 고려하십시오.

두.py

# encoding: utf-8
name = 'helló wörld from two'

하나 .py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

달리기 출력 python one.py 이다:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

이 예에서 two.name 가져 오지 않았기 때문에 UTF-8 인코딩 된 문자열 (유니 코드 아님)입니다. unicode_literals, 그리고 one.name 유니 코드 문자열입니다. 둘 다 혼합되면 Python은 인코딩 된 문자열 (ASCII라고 가정)을 디코딩하고 유니 코드로 변환하여 실패합니다. 당신이 그렇게한다면 효과가 있습니다 print name + two.name.decode('utf-8').

문자열을 인코딩하고 나중에 혼합하려고하면 같은 일이 발생할 수 있습니다. 예를 들어, 이것은 작동합니다.

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

산출:

DEBUG: <html><body>helló wörld</body></html>

그러나 추가 한 후 import unicode_literals 그것은 그렇지 않다 :

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

산출:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

그것은 실패하기 때문에 실패합니다 'DEBUG: %s' 유니 코드 문자열이므로 Python은 디코딩을 시도합니다. html. 인쇄물을 수정하는 몇 가지 방법이 print str('DEBUG: %s') % html 또는 print 'DEBUG: %s' % html.decode('utf-8').

이것이 유니 코드 문자열을 사용할 때 잠재적 인 gotchas를 이해하는 데 도움이되기를 바랍니다.

다른 팁

또한 2.6 (Python 2.6.5 RC1+이전)에서 유니 코드 리터럴은 키워드 인수와 잘 어울리지 않습니다 (문제 4978):

예를 들어 다음 코드는 Unicode_Literals없이 작동하지만 TypeError에서는 실패합니다. keywords must be string Unicode_literals가 사용되는 경우.

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings

당신이 추가하면 나는 그것을 발견했다 unicode_literals 지시문은 다음과 같은 것을 추가해야합니다.

 # -*- coding: utf-8

첫 번째 또는 두 번째 줄에 .py 파일. 그렇지 않으면 다음과 같은 선이 있습니다.

 foo = "barré"

다음과 같은 오류가 발생합니다.

SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198,
 but no encoding declared; see http://www.python.org/peps/pep-0263.html 
 for details

또한 그것을 고려하십시오 unicode_literal 영향을 미칠 것이다 eval() 하지만 repr() (IMHO가 버그 인 비대칭 동작), 즉, 즉 eval(repr(b'\xa4')) 동일하지 않습니다 b'\xa4' (Python 3에서와 같이).

이상적으로는 다음 코드는 불변의 불변이며, 모든 조합에 대해 항상 작동해야합니다. unicode_literals 및 Python {2.7, 3.x} 사용 :

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

두 번째 주장은 그 이후로 작동합니다 repr('\xa4') 평가합니다 u'\xa4' 파이썬 2.7.

더있다.

유니 코드를 견딜 수없는 문자열을 기대하는 라이브러리와 내장이 있습니다.

두 가지 예 :

내장 :

myenum = type('Enum', (), enum)

(Somegly Esotic)은 Unicode_Literals에서 작동하지 않습니다. : type ()은 문자열을 기대합니다.

도서관:

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

작동하지 않습니다 : WX PubSub 라이브러리는 문자열 메시지 유형을 기대합니다.

전자는 난해하고 쉽게 고정되어 있습니다

myenum = type(b'Enum', (), enum)

그러나 코드가 Pub.SendMessage ()에 대한 통화로 가득 차면 후자는 치명적입니다.

Dang it, eh?!?

클릭하면 모든 곳에서 유니 코드 예외가 발생합니다 모듈이있는 경우 from __future__ import unicode_literals 사용하는 곳에서 가져옵니다 click.echo. 악몽입니다…

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow