Question

I'm trying to do a matching using python.

I have a list of string (len~3000), and a file, and I want to check if for each line in the file, it has at least one of the strings in the list.

The most straight forward way is to check one by one, but it takes time (not that long though).

Is there a way that I can search that faster?

For example:

lst = ["aq", "bs", "ce"]

if the line is "aqwerqwerqwer"  -> true (since has "aq" in it)
if the line is "qweqweqwe" -> false (has none of "aq", "bs" or "ce")
Was it helpful?

Solution

You can use any and a generator expression:

# Please do not name a list "list" -- it overrides the built-in
lst = ["a", "b", "c"]
if any(s in line for s in lst):
    # Do stuff

The above code will test if any items in lst can be found in line. If so, # Do stuff will be run.

See a demonstration below:

>>> lst = ["aq", "bs", "ce"]
>>> if any(s in "aqwerqwerqwer" for s in lst):
...     print(True)
...
True
>>> if any(s in "qweqweqwe" for s in lst):
...     print(True)
...
>>>

OTHER TIPS

This is actually a good use case for using regular expression engine with an automatically-created regular expression.

Try:

def re_match(strings_to_match, my_file):
    # building regular expression to match
    expression = re.compile(
        '(' + 
        '|'.join(re.escape(item) for item in strings_to_match) +
        ')')

    # perform matching
    for line in my_file:
        if not expression.search(line):
            return False
    return True

Regular expression will be faster than simple linear scan of every string to match for every line. This is for two reasons: regular expressions are implemented in C, and regular expressions are compiled into a state machine that examines each of the input characters just once, instead of several times as in a naïve solution.

See comparison in an IPython notebook: http://nbviewer.ipython.org/gist/liori/10170227. Test data consists of 3000 strings to match over a list of 1 million lines. Naïve approach took 1min 46s on my machine whereas this solution was just 9.97 s.

You could use itertools.groupby:

from itertools import groupby
pats = ['pat', 'pat2', …]
matches = groupby(lines, keyfunc=lambda line:any(pat in line for pat in pats))

If your patterns are all single character strings, you can optimize this further using a set:

pats = set('abcd')
matches = groupby(lines, keyfunc=pats.intersection)

Which will result in an iterable similar to

[(matched patterns, lines matched),
 (empty list, lines not matched),
 (matched patterns, lines matched),
 …]

(Except it will be a generator, not a list.) That's the major logic of it. What follows is one way of iterating over that preprocessed generator to product output.

for linegrp in matches:
  for line in matched_pats, linegrp:
    if matched_pats:
      print('"{}" matched because of "{}"'.format(line, matched_pats))
    else:
      print('"{}" did not match')

More involved but much faster: pre-process your list of strings into a prefix trie.

Then, for each file line, starting at each character position, see how far you can walk into the trie.

If you kept a queue of all active tries, you only have to look at each character-position once as you scan through the line. You could also include a "minimum terminal depth" counter at each trie-node to allow you to truncate comparison early once you get near the end of the string.


A simpler half-step would be to reduce your big list of strings to a dict of lists of strings, indexed by the first three chars of each string you are looking for.

from itertools import count, tee, izip

def triwise(iterable):
    # base on pairwise, from the itertools documentation
    "s -> (s0,s1,s2), (s1,s2,s3), (s2,s3,s4), ..."
    a, b, c = tee(iterable, 3)
    next(b, None)
    next(c, None)
    next(c, None)
    return izip(a, b, c)

class Searcher:
    def __init__(self):
        self.index = {}

    def add_seek_strings(self, strings):
        for s in strings:
            pre = s[:3]
            if pre in self.index:
                self.index[pre].append(s)
            else:
                self.index[pre] = [s]

    def find_matches(self, target):
        offset = -1
        for a,b,c in triwise(target):
            offset += 1
            pre = a+b+c
            if pre in self.index:
                from_here = target[offset:]
                for seek in self.index[pre]:
                    if from_here.startswith(seek):
                        yield seek

    def is_match(self, target):
        for match in self.find_matches(target):
            return True
        return False

def main():
    srch = Searcher()
    srch.add_seek_strings(["the", "words", "you", "want"])

    with open("myfile.txt") as inf:
        matched_lines = [line for line in inf if srch.is_match(line)]

if __name__=="__main__":
    main()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top