DNA extraction python

https://stackoverflow.com/questions/10338616

object-tag

03-06-2021
|

Question

Now, I need to find a way in which Python can find the codon position number 5 of the above code and extract that sequence until position 12 (ATGG*CTTTACCTCGTC*TCACAGGAG). So the output should be something like this:

>CCODE1112_5..11
 CTTTACCTCGTC

How can I tell python to get the begin value after the first "_" and the end value after ".." so it can do it automatically? ? THANKS!!!

Solution

def extractseq( queryseq , begin=5, end =12):
   queryseq=queryseq.split('\n')#transform the string in a list of lines included in the string

   return queryseq[1][begin-1:end-1]

I think this function should work, beware of the index which begin at 0 in python

after written that in your script you just have to call the function subs=extractseq(seq,5,12)

ok sorry so if you want to extract the 5 and the 12 included in the substring one way to do that easly is:

substring=queryseq.split('\n')[0].split('_')[1].split('...')#extraction of the substring
begin=substring[0]
end = substring[1]

OTHER TIPS

I'd probably (sigh) use a regex to extract 5 and 12 from CCODE1112_5..12_ABC.

Then convert the extracted strings to int's.

Then use the int's as indexes in a string slice on the DNA data.

For the regex:

regex = re.compile(r'^[^]*(\d+)..(\d+)_.*$') regex.match('CCODE1112_5..12_ABC') match = regex.match('CCODE1112_5..12_ABC') match.group(1) '5' match.group(2) '12'

To convert those to int's, use int(match.group(1)), for example.

Then your indices are 1-based, while python's are 0-based. Also, python's starting point for a slice is at the value you want, and python's ending point for a slice is one past the value you want. So subtract one from group(1) and leave group(2) alone.

So something like: substring = dna_data[left_point-1:right_point]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow