Pergunta

I would like to show my students the result of opening as macroman/latin1 a file encoded as latin1/macroman [resp.]:

>>> s = u"Tout condamné à mort aura la tête tranchée."
>>> print s.encode("latin1").decode("macroman")
Tout condamnÈ ‡ mort aura la tÍte tranchÈe.
>>> print s.encode("macroman").decode("latin1")
Tout condamn  mort aura la tte tranche.

But I'm puzzled by the fact that the second conversion doesn't show any visible non ASCII character. Aren't macroman and latin1 both meant to be byte <-> character bijections?

NB: This is not Python-related, since I can reproduce the behaviour with a text editor.

Foi útil?

Solução

“Latin1” is a vague term and may refer to ISO Latin 1 (ISO 8859-1) or to Windows Latin 1 (windows-1252). The difference is that in ISO Latin 1, bytes 0x80 to 0x9F are designated as control characters (rarely used), whereas in Windows Latin 1, most of them are defined as graphic characters (punctuation and some non-Ascii Latin letters) and a few left undefined.

When you take e.g. the letter “é” and Latin1 encode (in either Latin1 encoding) it, you get the byte 0xE9. If you then interpret this byte as MacRoman encoded, as you seem to be doing, you get the “È” character. That’s why you get “condamnÈ”.

But if you take the letter “é” as MacRoman encoded, it’s 0x8E. When interpreting this byte as Latin1 data, the Latin1 encodings differ. In ISO Latin 1, it is the control character SINGLE SHIFT TWO (U+008E); in Windows Latin 1, it’s “Ž” LATIN CAPITAL LETTER Z WITH CARON (U+017D). Obviously, your code treats Latin1 as ISO Latin 1. Since U+008E has normally no meaning assigned to it in most programs, it is typically ignored in rendering, but in this case appatently displayed as a space.

The other cases are similar: MacRoman “à” is 0x88 and MacRoman “ê” is 0x90, both falling into the control character are in ISO 8859-1.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top