让蟒蛇代替非可编码字符由默认的字符串

https://stackoverflow.com/questions/1933184

20-09-2019
|

题

我想使蟒忽略字符它不能编码，通过简单地用字符串"<could not encode>"替换它们。

E.g，假设默认的编码是ASCII，命令

'%s is the word'%'ébác'

将产生

'<could not encode>b<could not encode>c is the word'

有没有什么办法，使这个默认的行为，在我所有的项目？

解决方案

在 str.encode 函数采用限定所述错误处理的可选参数：

str.encode([encoding[, errors]])

从文档：

返回字符串的编码版本。缺省编码是当前的默认字符串编码。可给予错误设定了不同的错误处理方案。对错误的缺省是“严格”，这意味着编码错误养UnicodeError。其他可能的值是“忽略”，“取代”，“xmlcharrefreplace”，“backslashreplace”和（）通过codecs.register_error注册任何其他名称，请参见编解码基础类。对于可能的编码列表，见标准编码。

在你的情况下， codecs.register_error 功能可能会感兴趣。

[备注坏字符

顺便说一句，请注意使用register_error，你可能会发现自己与你的字符串替换不只是个别坏人的角色，但连续的坏字符组，除非你不注意的时候。你得到一个电话，每坏字符的运行错误处理程序，而不是每个字符。

其他提示

>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

因此，例如：

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

添加您自己的回调codecs.register_error与您所选择的字符串替换。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow