Question

In my code, what I'm trying to do is clean up a FastA file by only including the letters A,C,T,G,N, and U in the output string. I'm trying to do this through a regular expression, which looks like this:

newFastA = (re.findall(r'A,T,G,C,U,N',self.fastAsequence)) #trying to extract all of the listed bases from my fastAsequence.
        print (newFastA)

However, I am not getting all the occurences of the bases in order. I think the format of my regular expression is incorrect, so if you could let me know what mistake I've made, that would be great.

Was it helpful?

Solution 3

You need to use a character set.

re.findall(r"[ATGCUN]", self.fastAsequence)

Your code looks for a LITERAL "A,T,G,C,U,N", and outputs all occurrences of that. Character sets in regex allow for a search of the type: "Any of the following: A,T,G,C,U,N" rather than "The following: A,T,G,C,U,N "

OTHER TIPS

print re.sub("[^ACTGNU]","",fastA_string)

to go with the million other answers youll get

or without re

print "".join(filter(lambda character:character in set("ACTGUN"),fastA_string)

I'd avoid regex entirely. You can use str.translate to remove the characters you don't want.

from string import ascii_letters

removechars = ''.join(set(ascii_letters) - set('ACTGNU'))

newFastA = self.fastAsequence.translate(None, removechars)

demo:

dna = 'ACTAGAGAUACCACG this will be removed GNUGNUGNU'

dna.translate(None, removechars)
Out[6]: 'ACTAGAGAUACCACG     GNUGNUGNU'

If you want to remove whitespace too, you can toss string.whitespace into removechars.

Sidenote, the above only works in python 2, in python 3 there's an additional step:

from string import ascii_letters, punctuation, whitespace

#showing how to remove whitespace and punctuation too in this example
removechars = ''.join(set(ascii_letters + punctuation + whitespace) - set('ACTGNU'))

trans = str.maketrans('', '', removechars)

dna.translate(trans)
Out[11]: 'ACTAGAGAUACCACGGNUGNUGNU'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top