The problem is that the string in text
is a Python 2 byte string that happens to contain UTF-8-encoded data. Offsets inside such a string are byte offsets which only correspond to character offsets when data is all-ASCII. The offsets used by get_iter_at_offset
, on the other hand, are always character offsets.
A quick fix for this issue is to convert the text to Unicode e.g. with:
text = text.decode('utf-8')
Then re.finditer
reports character offsets as well, and the program displays the expected output.