Creating UTF-16 newline characters in Python for Windows Notepad

Question

The problem is that you're opening the file in text mode, but trying to use it as a binary file.

This:

u"\r\n".encode("utf-16")

… encodes to '\r\0\n\0'.

Then this:

f.write('\r\0\n\0')

… converts the Unix newline to a Windows newline, giving '\r\0\r\n\0'.

And that, of course, breaks your UTF-16 encoding. Besides the fact that the two \r\n bytes will decode into the valid but unassigned codepoint U+0A0D, that's an odd number of bytes, meaning you've got a leftover \0. So, instead of L\0 being the next character, it's \0L, aka 䰀, and so on.

On top of that, you're probably writing a new UTF-16 BOM for each encoded string. Most Windows apps will actually transparently handle that and ignore them, so all you're practically doing is wasting two bytes/line, but it isn't actually correct.

The quick fix to the first problem is to open the file in binary mode:

f = open("testfile.txt", "wb")

This doesn't fix the multiple-BOM problem, but it fixes the broken \n problem. If you want to fix the BOM problem, you either use a stateful encode, or you explicitly specify 'utf-16-le' (or 'utf-16-be') for all writes but the first write.

But the easy fix, for both problems, is to use the io module (or, for older Python 2.x, the codecs module) to do all the hard work for you:

f = io.open("testfile.txt", "w", encoding="utf-8")
f.write("Line one")
f.write(u"\r\n")
f.write("Line two")