As far as I understand, the example that joins a lot of 'a'
s is just extremely simple example that shows the problem. In other words, the construction of the content (generally) can be more time and memory consuming than the search itself.
The problem with the standard re
module is that it uses the extended regular expression syntax, and it requires backtracking.
You may be interested in the very classic implementation by Thomson (NFA) -- see http://swtch.com/~rsc/regexp/regexp1.html for the explanation and the comparison of performance with the libraries that implement the extended syntax.
It seems that the re2
project can be useful for you. There should be the Python port -- see Is it possible to use re2 from Python? However, I do not know if it supports streaming and wherher any streaming regular expression engine for Python exists.
For understanding the Thomsons idea, you can also try the on-line visualization of the Regular Expression to NFA.