Python: getting correct string length when it contains surrogate pairs

https://stackoverflow.com/questions/12907022

07-07-2021
|

Question

Consider the following exchange on IPython:

In [1]: s = u'華袞與緼𦅷同歸'

In [2]: len(s)
Out[2]: 8

The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one.

Even if I use unicodedata, which returns the surrogate pair correctly as a single codepoint (\U00026177), when passed to len() the wrong length is still returned:

In [3]: import unicodedata

In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'


In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8

Without taking drastic steps like recompiling Python for UTF-32, is there a simple way to get the correct length in situations like this?

I'm on IPython 0.13, Python 2.7.2, Mac OS 10.8.2.

Solution

I think this has been fixen in 3.3. See:

http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

OTHER TIPS

I make a function to do this on Python 2:

SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
  return len(SURROGATE_PAIR.sub('.', s))

By replacing surrogate pairs with a single character, we 'fix' the len function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.

You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow