Question

I'm looking for a way to run a regex over a (long) iterable of "characters" in Python. (Python doesn't actually have characters, so it's actually an iterable of one-length strings. But same difference.)

The re module only allows searching over strings (or buffers), as far as I can tell.

I could implement it myself, but that seems a little silly.

Alternatively, I could convert the iterable to a string and run the regex over the string, but that gets (hideously) inefficient. (A worst-case example: re.search(".a", "".join('a' for a in range(10**8))) peaks at over 900M of RAM (private working set) on my (x64) machine, and takes ~12 seconds, even though it only needs to look at the first two characters in the iterable.)

Was it helpful?

Solution

As far as I understand, the example that joins a lot of 'a's is just extremely simple example that shows the problem. In other words, the construction of the content (generally) can be more time and memory consuming than the search itself.

The problem with the standard re module is that it uses the extended regular expression syntax, and it requires backtracking.

You may be interested in the very classic implementation by Thomson (NFA) -- see http://swtch.com/~rsc/regexp/regexp1.html for the explanation and the comparison of performance with the libraries that implement the extended syntax.

It seems that the re2 project can be useful for you. There should be the Python port -- see Is it possible to use re2 from Python? However, I do not know if it supports streaming and wherher any streaming regular expression engine for Python exists.

For understanding the Thomsons idea, you can also try the on-line visualization of the Regular Expression to NFA.

OTHER TIPS

If the number of elements in that list is truly to the order of 10**8 then you are probably better off doing a linear search if you only want to do it once. Otherwise, you have to create this huge string that is really very inefficient. The other thing I can think of if you need to do this more than once is inserting the collection into a hashtable and do the search faster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top