Python codecs line ending

Question

This is a great question.

It makes a difference whether you open a file with open() or codecs.open(). The former operates in terms of byte strings. The latter operates in terms of Unicode strings. In Python, these behave differently.

This same question came up as Python Issue 7643, What is a Unicode line break character?. The discussion, and the citations to the Unicode Character Database, are fascinating. Issue 7643 also gives this concise code snippet to demonstrate of the difference:

for s in '\x0a\x0d\x1c\x1d\x1e':
  print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

But it boils down to this.

To determine if bytes in byte strings are line breaks (or whitespace), Python uses the rules of ASCII control characters. By that measure, bytes 10 and 13 are line break characters (and Python treats byte 13 followed by 10 as a single line break).

But to determine if characters in Unicode strings are line breaks, Python follows the character classifications of the Unicode Character Database, documented at UAX #44, and of the UAX #14 Line Breaking Algorithm, section 5 Line Breaking Properties. According to Issue 7643, these documents identify three character properties which identify a character as a linebreak for Python's purposes:

General Category Zl "Line Separator"
General Category Zp "Paragraph Separator"
Bidirectional Class B "Paragraph Separator"

Characters 28 (0x001C), 29 (0x001D), and 30 (0x001E) have those character properties. Character 31 (0x001F) does not. Why? That's a question for the Unicode Technical Committee. But in ASCII, these characters were known as "File Separator", "Group Separator", "Record Separator", and "Unit Separator". Using a tabbed text data file as a comparison, the first three connote at least as much separation as a line break does, while the fourth is perhaps analogous to the tab.

You can see the code which actually defines these three Unicode characters as being line breaks in Python Unicode strings in Objects/unicodeobject.c. Look for array ascii_linebreak[]. This array underlies the implementation of unicode.splitlines(). Different code underlies str.splitlines(). I believe, but haven't traced it in the Python source code, that enumerate() on a file opened with codecs.open() is implemented in terms of unicode.splitlines().

You ask, "how can I prevent it from doing so?" I don't see any way to make splitlines() behave differently. However, you can open the file as a byte stream, read lines as bytes with the str.splitlines() behaviour, then decode each line as UTF-8 for use as a unicode string:

with open('unicodetest.txt', 'r') as f:
  for i,l in enumerate(f):
    print i, l.decode('UTF-8')
# prints "0 abcde" with special characters in between.

I assume you are using Python 2.x, not 3.x. My answer is based on Python 2.7.