Python Extracting strings from multiple lines of strings

https://stackoverflow.com/questions/20207674

05-08-2022
|

문제

I would like to extract strings from input file like the below:

>a11
UCUUUGGUUAUCUAGCUGUAUGA
>a11
UCUUUGGUUAUCUAGCUGUAUGA
>b22
UGGUCGACCAGUUGGAAAGUAAU
>b22
ACUUCACCUGGUCCACUAGCCGU
>b22
AGGUUGUCUGUGAUGAGUUCG
>t33
UUAAUGCUAAUCGUGAUAGGGGU
>t33
CAGUAACAAAGAUUCAUCCUUGU

The line starts with ">" is a header and the line below is a sequence.

I would like to extract the sequences with header only strats with ">b22"

This is my code which do not give the properl answer.

def extractData():
    filename = ("data.txt")
    infile = open(filename,'r')

    for x in infile.readlines():
        x = x.strip()
        if x.startswith(">"):
            header = x
        else:
            sequence = x
        if header.startswith(">b22"):
            print(header, sequence)
    infile.close()

extractData()

It gives result like this:

>b22 UCUUUGGUUAUCUAGCUGUAUGA
>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 AGGUUGUCUGUGAUGAGUUCG

But, my expected result is like this:

>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 AGGUUGUCUGUGAUGAGUUCG

Can somebody fix this please? What is the matter and what should I imply to get the correct result?

해결책

Minor changes in your code:

def extractData():
    filename = ("data.txt")
    infile = open(filename,'r')

    for x in infile.readlines():
        x = x.strip()
        if x.startswith(">"):
            header = x
        else:
            sequence = x
            if header.startswith(">b22"):
                print(header, sequence)
                header = ''


    infile.close()

extractData()

Btw, you can use debugger to identify what is wrong with the flow of program. If you are new to Python then I would recommend using Eclipse with Pydev plugin for interactive debugging. Link for Tutorial on Pydev in Eclipse

Having said that, issue appears because if header.startswith(">b22") is being evaluated for each line parsed from file. When you move it inside else block it will only be evaluated after you are done parsing sequence (and it does not evaluate for header lines, obviously).

다른 팁

Here is a different approach:

>>> with open('data.txt') as f:
...     for line in f:
...         if line.startswith('>b22'):
...             print('{0} {1}'.format(line.strip(), next(f).strip()))
...
>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 AGGUUGUCUGUGAUGAGUUCG

Since the file can be iterated over, when you reach the line with >b22, you can use next() to read the next line.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow