Why character offset is shifted in gtk.TextBuffer?

https://stackoverflow.com/questions/22327139

12-06-2023
|

Question

i have small application test application (please run it in terminal):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import time
import gtk


text = '''Python – język programowania wysokiego poziomu ogólnego przeznaczenia[2] i rozbudowanym pakiecie bibliotek standardowych[3], którego ideą przewodnią jest czytelność i klarowność kodu źródłowego. Jego składnia cechuje się przejrzystością i zwięzłością[4][5].

Python wspiera różne paradygmaty programowania: obiektowy, imperatywny oraz w mniejszym stopniu funkcyjny. Posiada w pełni dynamiczny system typów i automatyczne zarządzanie pamięcią, będąc w tym podobnym do języków Perl, Ruby, Scheme czy Tcl. Podobnie jak inne języki dynamiczne jest często używany jako język skryptowy. Interpretery Pythona są dostępne na wiele systemów operacyjnych.'''

main_window = gtk.Window(gtk.WINDOW_TOPLEVEL)
main_window.set_default_size(640, 480)
main_window.connect('destroy', lambda a: gtk.main_quit())
text_buffer = gtk.TextBuffer()
text_buffer.set_text(text)
text_view = gtk.TextView(text_buffer)
text_view.set_wrap_mode(gtk.WRAP_WORD)
main_window.add(text_view)
main_window.show_all()

for m in re.finditer('Python', text):
    start_iter = text_buffer.get_iter_at_offset(m.start())
    end_iter = text_buffer.get_iter_at_offset(m.end())
    t = text_buffer.get_text(start_iter, end_iter)
    print('This string should == Python', t)

gtk.main()

which demonstrates my problem. In this application i search a string with regular expressions, next i want to select that string in GtkTextView but unfortunetly character offset of match from MatchObject do not match with character offsets in GtkTextBuffer, why is that and how to fix that?

Solution

The problem is that the string in text is a Python 2 byte string that happens to contain UTF-8-encoded data. Offsets inside such a string are byte offsets which only correspond to character offsets when data is all-ASCII. The offsets used by get_iter_at_offset, on the other hand, are always character offsets.

A quick fix for this issue is to convert the text to Unicode e.g. with:

text = text.decode('utf-8')

Then re.finditer reports character offsets as well, and the program displays the expected output.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow