python - extract several lines following a matched string

https://stackoverflow.com/questions/21087121

27-09-2022
|

Question

I have two data files containing sets of 4 lines. I need to extract the sets of 4 lines contained in the second file if part of the 1st line of every set matches.

Here is an example of input data:

input1.txt
@abcde:134/1
JDOIJDEJAKJ
content1
content2

input2.txt
@abcde:134/2
JKDJFLJSIEF
content3
content4
@abcde:135/2
KFJKDJFKLDJ
content5
content6

Here is what the output should look like:

output.txt
@abcde:134/2
JKDJFLJSIEF
content3
content4

Here is my attempt at writing code...

import sys

filename1 = sys.argv[1] #input1.txt
filename2 = sys.argv[2] #input2.txt

F = open(filename1, 'r')
R = open(filename2, 'r')

def output(input1, input2):
    for line in input1:
        if "@" in line:
            for line2 in input2:
                if line[:-1] in line2:
                    for i in range(4):
                        print next(input2)

output = output(F, R)
write(output)

I get invalid syntax for next() which I can't figure out, and I would be happy if someone could correct my code or give me tips on how to make this work.

===EDIT=== OK, I think I have managed to implement the solutions proposed in the comments below (thank you). I am now running the code on a Terminal session connected by ssh to a remote Ubuntu server. Here is what the code looks like now. (This time I am running python2.7)

filename1 = sys.argv[1] #input file 1
filename2 = sys.argv[2] #input file 2 (some lines of which will be in the output)

F = open(filename1, 'r')
R = open(filename2, 'r')

def output(input1, input2):
    for line in input1:
        input2.seek(0)
        if "@" in line:
            for line2 in input2:
                if line[:-2] in line2:
                    for i in range(4):
                        out = next(input2)
                        print out
                        return

output (F, R)

Then I run this command:

python fetch_reverse.py test1.fq test.fq > test2.fq

I don't get any warnings, but the output file is empty. What am I doing wrong?

Solution

Split out the reading of the first file from reading the second file; gather all lines you want to match (unless you are reading hundreds of thousands of lines to match). Store all lines you want to match, minus the digit at the end, in a set for fast access.

Then scan the other file for matching lines:

def output(input1, input2):
    with input1:  # automatically close when done
        # set comprehension of all lines starting with @, minus last character
        to_match = {line.strip()[:-1] for line in input1 if line[0] == '@'}

    with input2:
        for line in input2:
            if line[0] == '@' and line.strip()[:-1] in to_match:
                print line.strip()
                for i in range(3):
                    print next(input2, '').strip()

You need to print the matched line too, then read the next three lines (line number 1 was already read).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow