Check if a string contains at least one of the strings in a list

Question 1

You can use any and a generator expression:

# Please do not name a list "list" -- it overrides the built-in
lst = ["a", "b", "c"]
if any(s in line for s in lst):
    # Do stuff

The above code will test if any items in lst can be found in line. If so, # Do stuff will be run.

See a demonstration below:

>>> lst = ["aq", "bs", "ce"]
>>> if any(s in "aqwerqwerqwer" for s in lst):
...     print(True)
...
True
>>> if any(s in "qweqweqwe" for s in lst):
...     print(True)
...
>>>

Question 2

This is actually a good use case for using regular expression engine with an automatically-created regular expression.

Try:

def re_match(strings_to_match, my_file):
    # building regular expression to match
    expression = re.compile(
        '(' + 
        '|'.join(re.escape(item) for item in strings_to_match) +
        ')')

    # perform matching
    for line in my_file:
        if not expression.search(line):
            return False
    return True

Regular expression will be faster than simple linear scan of every string to match for every line. This is for two reasons: regular expressions are implemented in C, and regular expressions are compiled into a state machine that examines each of the input characters just once, instead of several times as in a naïve solution.

See comparison in an IPython notebook: http://nbviewer.ipython.org/gist/liori/10170227. Test data consists of 3000 strings to match over a list of 1 million lines. Naïve approach took 1min 46s on my machine whereas this solution was just 9.97 s.

Question 3

You could use itertools.groupby:

from itertools import groupby
pats = ['pat', 'pat2', …]
matches = groupby(lines, keyfunc=lambda line:any(pat in line for pat in pats))

If your patterns are all single character strings, you can optimize this further using a set:

pats = set('abcd')
matches = groupby(lines, keyfunc=pats.intersection)

Which will result in an iterable similar to

[(matched patterns, lines matched),
 (empty list, lines not matched),
 (matched patterns, lines matched),
 …]

(Except it will be a generator, not a list.) That's the major logic of it. What follows is one way of iterating over that preprocessed generator to product output.

for linegrp in matches:
  for line in matched_pats, linegrp:
    if matched_pats:
      print('"{}" matched because of "{}"'.format(line, matched_pats))
    else:
      print('"{}" did not match')

Question 4

More involved but much faster: pre-process your list of strings into a prefix trie.

Then, for each file line, starting at each character position, see how far you can walk into the trie.

If you kept a queue of all active tries, you only have to look at each character-position once as you scan through the line. You could also include a "minimum terminal depth" counter at each trie-node to allow you to truncate comparison early once you get near the end of the string.

A simpler half-step would be to reduce your big list of strings to a dict of lists of strings, indexed by the first three chars of each string you are looking for.

from itertools import count, tee, izip

def triwise(iterable):
    # base on pairwise, from the itertools documentation
    "s -> (s0,s1,s2), (s1,s2,s3), (s2,s3,s4), ..."
    a, b, c = tee(iterable, 3)
    next(b, None)
    next(c, None)
    next(c, None)
    return izip(a, b, c)

class Searcher:
    def __init__(self):
        self.index = {}

    def add_seek_strings(self, strings):
        for s in strings:
            pre = s[:3]
            if pre in self.index:
                self.index[pre].append(s)
            else:
                self.index[pre] = [s]

    def find_matches(self, target):
        offset = -1
        for a,b,c in triwise(target):
            offset += 1
            pre = a+b+c
            if pre in self.index:
                from_here = target[offset:]
                for seek in self.index[pre]:
                    if from_here.startswith(seek):
                        yield seek

    def is_match(self, target):
        for match in self.find_matches(target):
            return True
        return False

def main():
    srch = Searcher()
    srch.add_seek_strings(["the", "words", "you", "want"])

    with open("myfile.txt") as inf:
        matched_lines = [line for line in inf if srch.is_match(line)]

if __name__=="__main__":
    main()