Is there a easy way to have a substring of a utf8 encode string, the substring's repr's length less than N in python

StackOverflow https://stackoverflow.com/questions/16264822

Question

for example i have a string, I hope find a easy way to get a substring, which encode in utf-8, and the length of the repr of the substring is <= N, of course i can try N/3 substring and increase N/3+1, N/3+2,...,but if there is a easy way?

word = u"this is a ship, and some other words".encode("utf-8")
#some way got a substring
substring = func(word, N)
#assert len(repr(substring)) <= N

Thanks!

Was it helpful?

Solution

A possible approach:

  1. Take first N-1 bytes of the repr of the whole string.
  2. Examine last 3 bytes to see if you broke an escape sequence and cut of bytes if necessary
  3. Append a quote, keeping in mind that it may be ' or ".
  4. Eval the repr back to utf-8.
  5. Examine the last few bytes to see if you broke the string in the middle of a Unicode code point and cut out bytes if necessary. You can tell apart leading bytes and continuation bytes by examining the bit pattern.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top