Search for word (from list of words) in line (from list of lines) and append values to new list. Python

Question 1

result = []
names = ['link', 'zelda', 'saria', 'ganon', 'volvagia']
lines = iter(data)
for line in lines:
    while line.startswith(">") and any(name in line for name in names):
        name = line
        upper_seq = []
        for line in lines:
            if not line.isupper():
                break
            upper_seq.append(line)
        else:
            line = "" # guard against infinite loop at EOF 

        result.append((name, ''.join(upper_seq)))

If there are many names then set() might be faster to find names in line instead of any(...):

names = set(names)
# ...
    if line.startswith(">") and names.intersection(line[1:].split()):
        # ...

Result

[('>link is the first', 'OIGFHFHAGIUUIIUFG'),
 ('>ganon is the fourth', 'ADGGHHHHHH'),
 ('>volvagia is the last', 'AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR')]

Question 2

use a list comprehension

print [line for line in lines if line.startswith(">") and set(my_words).intersection(line[1:].split())]

this decomposes to a for loop as follows

matched_line = []
for line in lines:
    if line.startswith(">") and set(my_words).intersection(line[1:].split()):
       matched_lines.append(line)

using a set intersection should be significantly faster than looping over each word in the list and seeign if it is in the string

>>> print [line for line in data if line.startswith(">") and set(query).intersection(line[1:].split())]
['>link is the first', '>ganon is the fourth', '>volvagia is the last']

Question 3

There are more elegant ways to do this, but I think this method might be the easiest for you to understand:

>>> found_lines = []
>>> sequences = []
>>> for line in data:
...     if line.startswith(">"):
...         for name in query:
...             if name in line:
...                 found_lines.append(line)
...     else:
...         sequences.append(line)
>>> print found_lines
['>link is the first', '>ganon is the fourth', '>volvagia is the last']
>>>

Always start simple, and think your way through the problem. What's the first thing you need to do? You want to loop over every line in data (for line in data).

For each of those lines, you want to check if it starts with >. (if line.startswith(">")). If it doesn't start with that character, then we can assume it's a "sequence", and add it to the sequences list (sequences.append(line))

If it does, then you want to check if any of the names in query appear in that line. What's the easiest way to do that? Loop over every one of the names (for name in query), and check it by itself (if name in line)