Domanda

With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:

import codecs
with open('temp.txt','w') as temp:
    temp.write(codecs.BOM_UTF16_LE)     
    text = unichr(33034)  # text = u'\u810a'
    temp.write(text.encode('utf-16-le'))

But either of these things, when replaced above, make the code work.

  1. unichr(33033) and unichr(33035) work correctly.

  2. 'utf-8' encoding (without BOM, byte-order mark).

How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?

È stato utile?

Soluzione

You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.

Open the file in binary mode instead:

open('temp.txt','wb')

Altri suggerimenti

@Joni's answer is the root of the problem, but if you use codecs.open instead it always opens in binary mode, even if not specified. Using the utf16 codec also automatically writes the BOM using native endian-ness as well:

import codecs
with codecs.open('temp.txt','w','utf16') as temp:
    temp.write(u'\u810a')

Hex dump of temp.txt:

FF FE 0A 81

Reference: codecs.open

You're already using the codecs library. When working with that file, you should swap out using open() with codecs.open() to transparently handle encoding.

import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
    temp.write(unichr(33033))
    temp.write(unichr(33034))
    temp.write(unichr(33035))

If you have a problem after that, you might have an issue with your viewer, not your Python script.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top