As Joran Beasley's answer shows, using plain string functions will be much faster than using regexps for static string lookups.
But testing startswith
N times could itself be a huge slowdown if N is very large and matches are uncommon. Also, since you were using finditer
rather than findall
, I suspect you may be worried about that case.
You can get the best of both worlds by using str.find
. Of course ultimately this is doing the same work as using startswith
at each point—but it's doing that work in C, zipping over long stretches with no matches as much as 20x faster.
On the other hand, there's no way to wrap this repeated find
up in a trivial looping expression. (Unless you build a complicated wrapper using iter
around a closure, but I doubt that would actually help anything.) So, the code will look more complicated than Joran's listcomp. And it could be slower when matches are very common (because in that case you're spending most of your time in the loop, and an explicit loop statement is slower than a comprehension).
On the plus side, the extra verbosity means it's more obvious how to customize it. For example, if you decide you want to skip overlapped matches, just do i += len(pattern)
instead of i += 1
.
def finditer(text, pattern):
i = 0
while True:
i = text.find(pattern, i)
if i == -1: return
yield i
i += 1
From a quick test (under 64-bit Apple CPython 2.7.5):
In [931]: pattern = 'ha'
In [932]: text = 'hahahaha'
In [933]: %timeit [i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
100000 loops, best of 3: 2.69 µs per loop
In [934]: %timeit list(finditer(text, pattern))
100000 loops, best of 3: 3.56 µs per loop
In [935]: text = ('hahahaha' + string.ascii_lowercase + 'ha')*100
pattern = 'ha'
In [936]: %timeit [i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
100000 loops, best of 3: 1.74 ms per loop
In [937]: %timeit list(finditer(text, pattern))
100000 loops, best of 3: 180 µs per loop
So, it's almost as fast as Joran's code even for a very short string with 50% matches; for a much longer string with 11% matches, it's already 9.6x faster. If matches are even less common, or if we don't actually need a list, obviously it will win even bigger.