Is there any streaming regular-expression module for Python?

Question 1

As far as I understand, the example that joins a lot of 'a's is just extremely simple example that shows the problem. In other words, the construction of the content (generally) can be more time and memory consuming than the search itself.

The problem with the standard re module is that it uses the extended regular expression syntax, and it requires backtracking.

You may be interested in the very classic implementation by Thomson (NFA) -- see http://swtch.com/~rsc/regexp/regexp1.html for the explanation and the comparison of performance with the libraries that implement the extended syntax.

It seems that the re2 project can be useful for you. There should be the Python port -- see Is it possible to use re2 from Python? However, I do not know if it supports streaming and wherher any streaming regular expression engine for Python exists.

For understanding the Thomsons idea, you can also try the on-line visualization of the Regular Expression to NFA.

Question 2

If the number of elements in that list is truly to the order of 10**8 then you are probably better off doing a linear search if you only want to do it once. Otherwise, you have to create this huge string that is really very inefficient. The other thing I can think of if you need to do this more than once is inserting the collection into a hashtable and do the search faster.