Search for word (from list of words) in line (from list of lines) and append values to new list. Python

StackOverflow https://stackoverflow.com/questions/15349121

Question

If you had a list of names . . .

query = ['link','zelda','saria','ganon','volvagia']

and a list of lines from a file

data = ['>link is the first','OIGFHFH','AGIUUIIUFG','>peach is the second',
'AGFDA','AFGDSGGGH','>luigi is the third','SAGSGFFG','AFGDFGDFG',
'DSGSFGAAA','>ganon is the fourth','ADGGHHHHHH','>volvagia is the last',
 'AFGDAAFGDA','ADFGAFD','ADFDFFDDFG','AHUUERR','>ness is another','ADFGGGGH',
'HHHDFDA']

how would you be able to look at all lines that start with '>' and then if they have one of the names name_list then include the line with the '>' and also the sequences following it (sequences following will always be in upper) in two separate lists

#example output file

name_list = ['>link is the first','>ganon is the fourth','>volvagia is the last']
seq_list = ['OIGFHFHAGIUUIIUFG','ADGGHHHHHH','AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR']

i would rather not use a dictionary to do this as i've been prompted to do in similar situations

so what i have so far is:

for line,name in zip(data,query):
    if bool(line[0] == '>' and re.search(name,line))==True:
        #but then i'm stuck because len(query) and len(data) are not equal

.... any help would be greatly appreciated``

Was it helpful?

Solution

result = []
names = ['link', 'zelda', 'saria', 'ganon', 'volvagia']
lines = iter(data)
for line in lines:
    while line.startswith(">") and any(name in line for name in names):
        name = line
        upper_seq = []
        for line in lines:
            if not line.isupper():
                break
            upper_seq.append(line)
        else:
            line = "" # guard against infinite loop at EOF 

        result.append((name, ''.join(upper_seq)))

If there are many names then set() might be faster to find names in line instead of any(...):

names = set(names)
# ...
    if line.startswith(">") and names.intersection(line[1:].split()):
        # ...

Result

[('>link is the first', 'OIGFHFHAGIUUIIUFG'),
 ('>ganon is the fourth', 'ADGGHHHHHH'),
 ('>volvagia is the last', 'AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR')]

OTHER TIPS

use a list comprehension

print [line for line in lines if line.startswith(">") and set(my_words).intersection(line[1:].split())]

this decomposes to a for loop as follows

matched_line = []
for line in lines:
    if line.startswith(">") and set(my_words).intersection(line[1:].split()):
       matched_lines.append(line)

using a set intersection should be significantly faster than looping over each word in the list and seeign if it is in the string

>>> print [line for line in data if line.startswith(">") and set(query).intersection(line[1:].split())]
['>link is the first', '>ganon is the fourth', '>volvagia is the last']

There are more elegant ways to do this, but I think this method might be the easiest for you to understand:

>>> found_lines = []
>>> sequences = []
>>> for line in data:
...     if line.startswith(">"):
...         for name in query:
...             if name in line:
...                 found_lines.append(line)
...     else:
...         sequences.append(line)
>>> print found_lines
['>link is the first', '>ganon is the fourth', '>volvagia is the last']
>>> 

Always start simple, and think your way through the problem. What's the first thing you need to do? You want to loop over every line in data (for line in data).

For each of those lines, you want to check if it starts with >. (if line.startswith(">")). If it doesn't start with that character, then we can assume it's a "sequence", and add it to the sequences list (sequences.append(line))

If it does, then you want to check if any of the names in query appear in that line. What's the easiest way to do that? Loop over every one of the names (for name in query), and check it by itself (if name in line)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top