How to implement a Unicode buffer in python

https://stackoverflow.com/questions/17906746

04-06-2022
|

题

After an initial search on this, I'm bit lost.

I want to use a buffer object to hold a sequence of Unicode code points. I just need to scan and extract tokens from said sequence, so basically this is a read only buffer, and we need functionality to advance a pointer within the buffer, and to extract sub-segments. The buffer object should of course support the usual regex and search ops on strings.

An ordinary Unicode string can be used for this, but the issue would be the creating of sub-string copies to simulate advancing a pointer within the buffer. This seems to be very inefficient esp for larger buffers, unless there's some workaround.

I can see that there's a Memoryview object that would be suitable, but it does not support Unicode (?).

What else can I use to provide the above functionality? (Whether in Py2 or Py3).

解决方案

It depends on what exactly is needed, but usually just one Unicode string is enough. If you need to take non-tiny slices, you can keep them as 3-tuples (big unicode, start pos, end pos) or just make custom objects with these 3 attributes and whatever API is needed. The point is that a lot of methods like unicode.find() or the regex pattern objects's search() support specifying start and end points. So you can do most basic things without actually needing to slice the single big unicode string.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow