Specifying unicode literal's encoding on a per-literal basis

https://stackoverflow.com/questions/19420397

01-07-2022
|

Pergunta

According to the documentation, it is possible to define the encoding of the literals used in the python source like this:

# -*- coding: latin-1 -*-

u = u'abcdé'  # This is a unicode string encoded in latin-1

Is there any syntax support to specify the encoding on a literal basis? I am looking for something like:

latin1 = u('latin-1')'abcdé'  # This is a unicode string encoded in latin-1
utf8   = u('utf-8')'xxxxx'    # This is a unicode string encoded in utf-8

I know that syntax does not make sense, but I am looking for something similar. What can I do? Or is it maybe not possible to have a single source file with unicode strings in different encodings?

Solução

There is no way for you to mark a unicode literal as having using a different encoding from the rest of the source file, no.

Instead, you'd manually decode the literal from a bytestring instead:

latin1 = 'abcdé'.decode('latin1')  # provided `é` is stored in the source as a E9 byte.

or using escape sequences:

latin1 = 'abcd\xe9'.decode('latin1')

The whole point of the source-code codec line is to support using an arbitrary codec in your editor. Source code should never use mixed encodings, really.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow