문제

It seems Python's UTF-8 encoding (codecs package) interprets Unicode characters 28, 29, and 30 as line endings. Why? And how can I prevent it from doing so?

Example code:

with open('unicodetest.txt', 'w') as f:
  f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
with open('unicodetest.txt', 'r') as f:
  for i,l in enumerate(f):
    print i, l
# prints "0 abcde" with special characters in between.

The point here is that it reads it as one line as I expect it to do. Now when I use codecs to read it in UTF-8, it interprets it as many lines.

import codecs
with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
  for i,l in enumerate(f):
    print i, l
# 0 a
# 1 b
# 2 c
# 3 de
# (again with the special characters after each a, b, c, d

The characters 28 through 31 are described as "Information Separator Four" through "One" (in that order). Two things strike me: 1) 28 to 30 are interpreted as line ends, 2) 31 is not. Is this intended behaviour? Where can I find a definition of which characters are interpreted as line ends? Is there a way to not interpret them as line ends?

Thanks.

edit forgot to copy the 'UTF-8' argument in codecs.open. The code in my question is now corrected.

도움이 되었습니까?

해결책

This is a great question.

It makes a difference whether you open a file with open() or codecs.open(). The former operates in terms of byte strings. The latter operates in terms of Unicode strings. In Python, these behave differently.

This same question came up as Python Issue 7643, What is a Unicode line break character?. The discussion, and the citations to the Unicode Character Database, are fascinating. Issue 7643 also gives this concise code snippet to demonstrate of the difference:

for s in '\x0a\x0d\x1c\x1d\x1e':
  print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

But it boils down to this.

To determine if bytes in byte strings are line breaks (or whitespace), Python uses the rules of ASCII control characters. By that measure, bytes 10 and 13 are line break characters (and Python treats byte 13 followed by 10 as a single line break).

But to determine if characters in Unicode strings are line breaks, Python follows the character classifications of the Unicode Character Database, documented at UAX #44, and of the UAX #14 Line Breaking Algorithm, section 5 Line Breaking Properties. According to Issue 7643, these documents identify three character properties which identify a character as a linebreak for Python's purposes:

  • General Category Zl "Line Separator"
  • General Category Zp "Paragraph Separator"
  • Bidirectional Class B "Paragraph Separator"

Characters 28 (0x001C), 29 (0x001D), and 30 (0x001E) have those character properties. Character 31 (0x001F) does not. Why? That's a question for the Unicode Technical Committee. But in ASCII, these characters were known as "File Separator", "Group Separator", "Record Separator", and "Unit Separator". Using a tabbed text data file as a comparison, the first three connote at least as much separation as a line break does, while the fourth is perhaps analogous to the tab.

You can see the code which actually defines these three Unicode characters as being line breaks in Python Unicode strings in Objects/unicodeobject.c. Look for array ascii_linebreak[]. This array underlies the implementation of unicode.splitlines(). Different code underlies str.splitlines(). I believe, but haven't traced it in the Python source code, that enumerate() on a file opened with codecs.open() is implemented in terms of unicode.splitlines().

You ask, "how can I prevent it from doing so?" I don't see any way to make splitlines() behave differently. However, you can open the file as a byte stream, read lines as bytes with the str.splitlines() behaviour, then decode each line as UTF-8 for use as a unicode string:

with open('unicodetest.txt', 'r') as f:
  for i,l in enumerate(f):
    print i, l.decode('UTF-8')
# prints "0 abcde" with special characters in between.

I assume you are using Python 2.x, not 3.x. My answer is based on Python 2.7.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top