Question

Here is my solution to the problem of rosalind project.

def prot(rna):
  for i in xrange(3, (5*len(rna))//4+1, 4):
    rna=rna[:i]+','+rna[i:]
  rnaList=rna.split(',')
  bases=['U','C','A','G']
  codons = [a+b+c for a in bases for b in bases for c in bases]
  amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
  codon_table = dict(zip(codons, amino_acids))
  peptide=[]
  for i in range (len (rnaList)):
    if codon_table[rnaList[i]]=='*':
      break
    peptide+=[codon_table[rnaList[i]]]
  output=''
  for i in peptide:
    output+=str(i)
  return output

If I run prot('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA'), I get the correct output 'MAMAPRTEINSTRING'. However if the sequence of rna (the input string) is hundreds of nucleotides (characters) long I got an error:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 11, in prot
 KeyError: 'CUGGAAACGCAGCCGACAUUCGCUGAAGUGUAG'

Can you point me where I went wrong?

Was it helpful?

Solution

Given that you have a KeyError, the problem must be in one of your attempts to access codon_table[rnaList[i]]. You are assuming each item in rnalist is three characters, but evidently, at some point, that stops being True and one of the items is 'CUGGAAACGCAGCCGACAUUCGCUGAAGUGUAG'.

This happens because when you reassign rna = rna[:i]+','+rna[i:] you change the length of rna, such that your indices i no longer reach the end of the list. This means that for any rna where len(rna) > 60, the last item in the list will not have length 3. If there is a stop codon before you reach the item it isn't a problem, but if you reach it you get the KeyError.

I suggest you rewrite the start of your function, e.g. using the grouper recipe from itertools:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def prot(rna):
    rnaList = ["".join(t) for t in grouper(rna, 3)]
    ...

Note also that you can use

peptide.append(codon_table[rnaList[i]])

and

return "".join(peptide)

to simplify your code.

OTHER TIPS

This does not answer your question, but note that you could solve this very succinctly using BioPython:

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

def rna2prot(rna):
    rna = Seq(rna, IUPAC.unambiguous_rna)
    return str(rna.translate(to_stop=True))

For example:

>>> print rna2prot('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')
MAMAPRTEINSTRING

Your code for breaking the rna into 3-char blocks is a bit nasty; you spend a lot of time breaking and rebuilding strings to no real purpose.

Building the codon_table only needs to be done once, not every time your function is run.

Here is a simplified version:

from itertools import product, takewhile

bases = "UCAG"
codons = ("".join(trio) for trio in product(bases, repeat=3))
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

def prot(rna):
    rna_codons = [rna[i:i+3] for i in range(0, len(rna) - 2, 3)]
    aminos = takewhile(
        lambda amino: amino != "*",
        (codon_table[codon] for codon in rna_codons)
    )
    return "".join(aminos)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top