문제

How does one remove a header from a long string of text?

I have a program that displays a FASTA file as

...TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG...

The string is large and contains multiple headers like this

So the headers that need to be trimmed start with a > and end with a $ There's multiple headers, ranging from IonTorrenttrimmedcontig1 to IonTorrenttrimmedcontig25

How can I cut on the > and the $, remove everything inbetween, and seperate the code before and after into seperate list elements?

The file is read from a standard FASTA file, so I´d be very happy to hear possible solutions on the input step as well.

도움이 되었습니까?

해결책

As it is part of fasta file, so you are going to slice it like this:

>>> import re
>>> a = "TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG"
>>> re.split(">[^$]*\$", a)
['TCGATCATCGATCG', 'CCGTAGGTGAACCTGCGGAAG']

Also, some people are answering with slicing with '>ion1'. That's totally wrong!

I believe your problem is solved! I am also editing a tag with bioinformatics for this question!

다른 팁

I would use the re module for that:

>>> s = "blablabla>ion1$foobar>ion2$etc>ion3$..."
>>> import re
>>> re.split(">[^$]*\$",s)
['blablabla', 'foobar', 'etc', '...']

And if you have 1 string on each line:

>>> with open("foo.txt", "r") as f:
...   for line in f:
...     re.split(">[^$]*\$",line[:-1])
... 
['blablabla', 'foobar', 'etc', '...']
['fofofofofo', 'barbarbar', 'blablabla']

If you are reading over every line there a few ways to do this. You could use partition (partition returns a list containing 3 elements: [the text before the specified string, the specified string, and the text after]):

for line in file:
    stripped_header = line.partition(">")[2].partition("$")[0]

You could use split:

for line in file:
    stripped_header = line.spilt(">")[1].split("$")[0]

You could loop over all the elements in the string and only append after you pass ">" but before "$" (however this will not be nearly as efficient):

for line in file:
    bool = False
    stripped_header = ""
    for char in line:
        if char == ">":
            bool = True
        elif bool:
            if char != "$":
                stripped_header += char
            else:
                bool = False

Or alternatively use a regular expression, but it seems like my peers have already beat me to it!

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top