Question

If I have a pattern 'haha' and a text 'hahahaha', I want the indices of the pattern to be returned in a list like [0,2,4]. However, if I do it this way

def find_pat(text, pattern)
    import re
        return[x.start() for x in re.finditer(pattern,text)]

it returns [0,4], and fails to see the repetition of the pattern at index 2. How would I achieve the needed output in the most pythonic and efficient way?

Was it helpful?

Solution

pattern = "haha"
text = "hahahaha"
[i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]

iter will consume the tokens it has seen so it consumes the first four letters with the first match ...

actually the solution using the look ahead is probably the right one(based on the question rather than the usecase) but this would also work ...

OTHER TIPS

Use positive lookahead assertion:

>>> import re
>>> def find_pat(text, pattern):
...     return [x.start() for x in re.finditer("(?={})".format(pattern), text)]
...
>>> find_pat('hahahaha', 'haha')
[0, 2, 4]
>>>

Here is a reference.

As Joran Beasley's answer shows, using plain string functions will be much faster than using regexps for static string lookups.

But testing startswith N times could itself be a huge slowdown if N is very large and matches are uncommon. Also, since you were using finditer rather than findall, I suspect you may be worried about that case.

You can get the best of both worlds by using str.find. Of course ultimately this is doing the same work as using startswith at each point—but it's doing that work in C, zipping over long stretches with no matches as much as 20x faster.

On the other hand, there's no way to wrap this repeated find up in a trivial looping expression. (Unless you build a complicated wrapper using iter around a closure, but I doubt that would actually help anything.) So, the code will look more complicated than Joran's listcomp. And it could be slower when matches are very common (because in that case you're spending most of your time in the loop, and an explicit loop statement is slower than a comprehension).

On the plus side, the extra verbosity means it's more obvious how to customize it. For example, if you decide you want to skip overlapped matches, just do i += len(pattern) instead of i += 1.

def finditer(text, pattern):
    i = 0
    while True:
        i = text.find(pattern, i)
        if i == -1: return
        yield i
        i += 1

From a quick test (under 64-bit Apple CPython 2.7.5):

In [931]: pattern = 'ha'
In [932]: text = 'hahahaha'
In [933]: %timeit [i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
100000 loops, best of 3: 2.69 µs per loop
In [934]: %timeit list(finditer(text, pattern))
100000 loops, best of 3: 3.56 µs per loop
In [935]: text = ('hahahaha' + string.ascii_lowercase + 'ha')*100
pattern = 'ha'
In [936]: %timeit [i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
100000 loops, best of 3: 1.74 ms per loop
In [937]: %timeit list(finditer(text, pattern))
100000 loops, best of 3: 180 µs per loop

So, it's almost as fast as Joran's code even for a very short string with 50% matches; for a much longer string with 11% matches, it's already 9.6x faster. If matches are even less common, or if we don't actually need a list, obviously it will win even bigger.

I'm sure there's a better way to do this. Here's my first take on it though:

EDIT: I misread the question -- he was trying to match [haha]haha, ha[haha]ha, and haha[haha]. Mine only matches [haha]haha and haha[haha] because I thought he was looking for unique matches. Reading comprehension is a plus, when programming.

def find_text(text,pattern):
    indexes = list()
    index = 0
    patlen = len(pattern)
    while index<=len(text)-patlen:
        if text[index:].startswith(pattern):
            indexes.append(index)
            index+=patlen
        else:
            index+=1
    return indexes
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top