Pythonは、デフォルトでは文字列で非符号化可能な文字を交換します

https://stackoverflow.com/questions/1933184

20-09-2019
|

質問

私は、Pythonが、それは単に文字列"<could not encode>"に置き換えることで、エンコードできない文字を無視したい。

例えば、デフォルトエンコーディングがASCIIであるコマンドを仮定

'%s is the word'%'ébác'

タグをもたらすであろう

'<could not encode>b<could not encode>c is the word'

すべての私のプロジェクト全体で、このデフォルトの動作をする方法はありますか？

解決

str.encode の関数は、エラー処理を定義するオプションの引数を取ります。

str.encode([encoding[, errors]])

のドキュメントから：

文字列のエンコードされたバージョンを返します。デフォルトエンコーディングは、現在のデフォルトエンコーディングです。エラーは異なるエラー処理スキームを設定するために与えられてもよいです。エラーのデフォルトは、エンコーディングエラーがはUnicodeErrorを上げることを意味し、「厳しい」です。他の可能な値は、「xmlcharrefreplace」、「backslashreplace」バックや関数codecs.register_error経由で登録された他の名前を（）「無視」「置き換える」、セクションコーデック基底クラスを参照してくださいしています。可能なエンコーディングのリストについては、セクション標準エンコーディングを参照してください。

あなたのケースでは、 codecs.register_error の機能は、関心のあるかもしれません。

[の悪い文字について注意してください。の

ところで、あなたは注意を払っていない限り、あなたはおそらく、あなた自身があなたの文字列を使用して個々の不正な文字が、連続した不正な文字のグループだけではなくを交換見つけるregister_errorを使用する場合に注意してください。あなたは悪い文字の走行あたりではなく、文字ごとのエラーハンドラへの1つのコールを取得します。

他のヒント

>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

ですから、例えばます：

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

お好みの文字列に置き換える関数codecs.register_errorに独自のコールバックを追加します。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow