Question

How does one remove a header from a long string of text?

I have a program that displays a FASTA file as

...TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG...

The string is large and contains multiple headers like this

So the headers that need to be trimmed start with a > and end with a $ There's multiple headers, ranging from IonTorrenttrimmedcontig1 to IonTorrenttrimmedcontig25

How can I cut on the > and the $, remove everything inbetween, and seperate the code before and after into seperate list elements?

The file is read from a standard FASTA file, so I´d be very happy to hear possible solutions on the input step as well.

Était-ce utile?

La solution

As it is part of fasta file, so you are going to slice it like this:

>>> import re
>>> a = "TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG"
>>> re.split(">[^$]*\$", a)
['TCGATCATCGATCG', 'CCGTAGGTGAACCTGCGGAAG']

Also, some people are answering with slicing with '>ion1'. That's totally wrong!

I believe your problem is solved! I am also editing a tag with bioinformatics for this question!

Autres conseils

I would use the re module for that:

>>> s = "blablabla>ion1$foobar>ion2$etc>ion3$..."
>>> import re
>>> re.split(">[^$]*\$",s)
['blablabla', 'foobar', 'etc', '...']

And if you have 1 string on each line:

>>> with open("foo.txt", "r") as f:
...   for line in f:
...     re.split(">[^$]*\$",line[:-1])
... 
['blablabla', 'foobar', 'etc', '...']
['fofofofofo', 'barbarbar', 'blablabla']

If you are reading over every line there a few ways to do this. You could use partition (partition returns a list containing 3 elements: [the text before the specified string, the specified string, and the text after]):

for line in file:
    stripped_header = line.partition(">")[2].partition("$")[0]

You could use split:

for line in file:
    stripped_header = line.spilt(">")[1].split("$")[0]

You could loop over all the elements in the string and only append after you pass ">" but before "$" (however this will not be nearly as efficient):

for line in file:
    bool = False
    stripped_header = ""
    for char in line:
        if char == ">":
            bool = True
        elif bool:
            if char != "$":
                stripped_header += char
            else:
                bool = False

Or alternatively use a regular expression, but it seems like my peers have already beat me to it!

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top