Question

In Python 2.7 running in Ubuntu this code:

f = open("testfile.txt", "w")
f.write("Line one".encode("utf-16"))
f.write(u"\r\n".encode("utf-16"))
f.write("Line two".encode("utf-16"))

produces the desired newline between the two lines of text when read in Gedit:

Line one
Line two

However, the same code executed in Windows 7 and read in Notepad produces unintelligible characters after "Line one" but no newline is recognized by Notepad. How can I write correct newline characters for UTF-16 in Windows to match the output I get in Ubuntu?

I am writing output for a Windows only application that only reads Unicode UTF-16. I've spent hours trying out different tips, but nothing seems to work for Notepad. It's worth mentioning that I can successfully convert a text file to UTF-16 right in the Notepad, but I'd rather have the script save the encoding correctly in the first place.

Was it helpful?

Solution

The problem is that you're opening the file in text mode, but trying to use it as a binary file.

This:

u"\r\n".encode("utf-16")

… encodes to '\r\0\n\0'.

Then this:

f.write('\r\0\n\0')

… converts the Unix newline to a Windows newline, giving '\r\0\r\n\0'.

And that, of course, breaks your UTF-16 encoding. Besides the fact that the two \r\n bytes will decode into the valid but unassigned codepoint U+0A0D, that's an odd number of bytes, meaning you've got a leftover \0. So, instead of L\0 being the next character, it's \0L, aka , and so on.

On top of that, you're probably writing a new UTF-16 BOM for each encoded string. Most Windows apps will actually transparently handle that and ignore them, so all you're practically doing is wasting two bytes/line, but it isn't actually correct.


The quick fix to the first problem is to open the file in binary mode:

f = open("testfile.txt", "wb")

This doesn't fix the multiple-BOM problem, but it fixes the broken \n problem. If you want to fix the BOM problem, you either use a stateful encode, or you explicitly specify 'utf-16-le' (or 'utf-16-be') for all writes but the first write.


But the easy fix, for both problems, is to use the io module (or, for older Python 2.x, the codecs module) to do all the hard work for you:

f = io.open("testfile.txt", "w", encoding="utf-8")
f.write("Line one")
f.write(u"\r\n")
f.write("Line two")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top