After an initial search on this, I'm bit lost.

I want to use a buffer object to hold a sequence of Unicode code points. I just need to scan and extract tokens from said sequence, so basically this is a read only buffer, and we need functionality to advance a pointer within the buffer, and to extract sub-segments. The buffer object should of course support the usual regex and search ops on strings.

An ordinary Unicode string can be used for this, but the issue would be the creating of sub-string copies to simulate advancing a pointer within the buffer. This seems to be very inefficient esp for larger buffers, unless there's some workaround.

I can see that there's a Memoryview object that would be suitable, but it does not support Unicode (?).

What else can I use to provide the above functionality? (Whether in Py2 or Py3).

有帮助吗?

解决方案

It depends on what exactly is needed, but usually just one Unicode string is enough. If you need to take non-tiny slices, you can keep them as 3-tuples (big unicode, start pos, end pos) or just make custom objects with these 3 attributes and whatever API is needed. The point is that a lot of methods like unicode.find() or the regex pattern objects's search() support specifying start and end points. So you can do most basic things without actually needing to slice the single big unicode string.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top