在Python 2.6中使用unicode_literals的任何陷阱？

https://stackoverflow.com/questions/809796

03-07-2019
|

题

我们已经在Python 2.6下运行了我们的代码库。为了准备Python 3.0，我们开始添加：

from __future__ import unicode_literals

进入我们的 .py 文件（我们修改它们）。我想知道是否还有其他人这样做并且遇到了任何非显而易见的陷阱（可能是在花了很多时间调试之后）。

解决方案

我使用unicode字符串的主要问题是你将utf-8编码的字符串与unicode字符串混合使用。

例如，请考虑以下脚本。

two.py

# encoding: utf-8
name = 'helló wörld from two'

one.py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

运行 python one.py 的输出是：

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

在这个例子中， two.name 是一个utf-8编码的字符串（不是unicode），因为它没有导入 unicode_literals ，而 one.name 是一个unicode字符串。当你混合两者时，python尝试解码编码的字符串（假设它是ascii）并将其转换为unicode并失败。如果您执行 print name + two.name.decode（'utf-8'），它将起作用。

如果您对字符串进行编码并尝试稍后混合它们，则会发生同样的情况。例如，这有效：

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

输出：

DEBUG: <html><body>helló wörld</body></html>

但是在添加 import unicode_literals 之后却没有：

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

输出：

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

失败，因为'DEBUG：％s'是一个unicode字符串，因此python尝试解码 html 。修复打印的几种方法是执行 print str（'DEBUG：％s'）％html 或 print'DEBUG：％s'％html.decode（'utf-8 “）



我希望这有助于您了解使用unicode字符串时的潜在问题。


						 其他提示


			
	
		
	
	
			同样在2.6（在python 2.6.5 RC1 +之前），unicode文字对关键字参数不起作用（ issue4978 ）：

以下代码在没有unicode_literals的情况下工作，但在TypeError时失败：如果使用unicode_literals，关键字必须是字符串。

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings
	


	
		
	
	
			我确实发现如果你添加 unicode_literals 指令，你还应该添加如下内容：

 # -*- coding: utf-8


到你的.py文件的第一行或第二行。否则行如：

 foo = "barré"


会导致错误，例如：

SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198,
 but no encoding declared; see http://www.python.org/peps/pep-0263.html 
 for details
	


	
		
	
	
			还要考虑 unicode_literal 会影响 eval（）但不会影响 repr（）（imho是一个错误的非对称行为） ，即 eval（repr（b'\ xa4'））将不等于 b'\ xa4'（与Python 3一样）。

理想情况下，对于 unicode_literals 和Python {2.7,3.x}用法的所有组合，以下代码应该始终有效：

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+


第二个断言恰好起作用，因为 repr（'\ xa4'）在Python 2.7中求值为 u'\ xa4'。
	


	
		
	
	
			还有更多。

有些库和内置函数需要不能容忍unicode的字符串。

两个例子：

内建：

myenum = type('Enum', (), enum)


（略带esotic）不适用于unicode_literals：type（）需要一个字符串。

库：

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")


不起作用：wx pubsub库需要一个字符串消息类型。

前者是深奥的，很容易用修复

myenum = type(b'Enum', (), enum)


但如果您的代码充满了对pub.sendMessage（）的调用（后者是），则后者是毁灭性的。

该死，呃？！？

Click会在整个地方引发unicode例外如果在使用 click.echo 的地方导入了来自__future__ import unicode_literals 的的任何模块。这是一场噩梦＆＃8230;



	
		
			许可以下： CC-BY-SA 和 归因
			不隶属于 StackOverflow