Properly split unicode string on byte count [duplicate]

https://stackoverflow.com/questions/23413292

13-07-2023
|

Question

I want to split unicode string to max 255 byte characters and return the result as unicode:

# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')

Problem with this snippet, is that if 255-th byte character is part of 2-byte unicode character, I'll get error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data

Even if I handle the error I'll get unwanted garbage at the string end.

How to solve this more elegantly?

Solution

One very nice property of UTF-8 is that trailing bytes can easily be differentiated from starting bytes. Just work backwards until you've deleted a starting byte.

trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
    final = -1
    while ord(trunc_s[final]) & 0xc0 == 0x80:
        final -= 1
    trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')

Edit: Check out the answers in the question identified as a duplicate, too.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow