Is there a easy way to have a substring of a utf8 encode string, the substring's repr's length less than N in python

https://stackoverflow.com/questions/16264822

13-04-2022
|

Question

for example i have a string, I hope find a easy way to get a substring, which encode in utf-8, and the length of the repr of the substring is <= N, of course i can try N/3 substring and increase N/3+1, N/3+2,...,but if there is a easy way?

word = u"this is a ship, and some other words".encode("utf-8")
#some way got a substring
substring = func(word, N)
#assert len(repr(substring)) <= N

Thanks!

Solution

A possible approach:

Take first N-1 bytes of the repr of the whole string.
Examine last 3 bytes to see if you broke an escape sequence and cut of bytes if necessary
Append a quote, keeping in mind that it may be ' or ".
Eval the repr back to utf-8.
Examine the last few bytes to see if you broke the string in the middle of a Unicode code point and cut out bytes if necessary. You can tell apart leading bytes and continuation bytes by examining the bit pattern.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow