Question

How does one remove a header from a long string of text?

I have a program that displays a FASTA file as

...TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG...

The string is large and contains multiple headers like this

So the headers that need to be trimmed start with a > and end with a $ There's multiple headers, ranging from IonTorrenttrimmedcontig1 to IonTorrenttrimmedcontig25

How can I cut on the > and the $, remove everything inbetween, and seperate the code before and after into seperate list elements?

The file is read from a standard FASTA file, so I´d be very happy to hear possible solutions on the input step as well.

Was it helpful?

Solution

As it is part of fasta file, so you are going to slice it like this:

>>> import re
>>> a = "TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG"
>>> re.split(">[^$]*\$", a)
['TCGATCATCGATCG', 'CCGTAGGTGAACCTGCGGAAG']

Also, some people are answering with slicing with '>ion1'. That's totally wrong!

I believe your problem is solved! I am also editing a tag with bioinformatics for this question!

OTHER TIPS

I would use the re module for that:

>>> s = "blablabla>ion1$foobar>ion2$etc>ion3$..."
>>> import re
>>> re.split(">[^$]*\$",s)
['blablabla', 'foobar', 'etc', '...']

And if you have 1 string on each line:

>>> with open("foo.txt", "r") as f:
...   for line in f:
...     re.split(">[^$]*\$",line[:-1])
... 
['blablabla', 'foobar', 'etc', '...']
['fofofofofo', 'barbarbar', 'blablabla']

If you are reading over every line there a few ways to do this. You could use partition (partition returns a list containing 3 elements: [the text before the specified string, the specified string, and the text after]):

for line in file:
    stripped_header = line.partition(">")[2].partition("$")[0]

You could use split:

for line in file:
    stripped_header = line.spilt(">")[1].split("$")[0]

You could loop over all the elements in the string and only append after you pass ">" but before "$" (however this will not be nearly as efficient):

for line in file:
    bool = False
    stripped_header = ""
    for char in line:
        if char == ">":
            bool = True
        elif bool:
            if char != "$":
                stripped_header += char
            else:
                bool = False

Or alternatively use a regular expression, but it seems like my peers have already beat me to it!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top