Question

I have the below script to count the number of codons (codon.list.csv) in a gene file (test.fasta), it is however counting all codons irrespective of frame, I would like to count each codon only in frame 0, (ATG,TAT,TAT,TAA). For example:

>test1
ATGTATTATTAA

ATG:1 TAT:2 TAA:1

At the moment my script is counting TGT,ATT,TTA etc.. which I don't require.

I thought this would be easier but I cannot get it corrected...

Any advice would be great!

from Bio import  SeqIO
mRNA_sequences = "test.fasta"

in_seq_handle = open(mRNA_sequences)
seq_dict = SeqIO.to_dict(SeqIO.parse(in_seq_handle, "fasta"))
in_seq_handle.close()
seq_dict_keys =  seq_dict.keys()

dict_sequences2={} 
dict_codons = {}

contig_file = open("codon.list.csv")
for line in contig_file:
    gene_id = line[0:3]
    for sequence in seq_dict.values():
        seqstring = sequence.seq

        if dict_hepts.has_key((line[:-1])):
            dict_codons[(line[:-1])] += seqstring.count(gene_id)
        else:
            dict_codons[(line[:-1])] = seqstring.count(gene_id)

print dict_codons
Was it helpful?

Solution

How about this:

a = 'ATGTATTATTAA'

codons = (a[n:n+3] for n in xrange(0,len(a),3)) # creates generator

dict_codons = {}

for codon in codons:
    if dict_codons.has_key(codon):
        dict_codons[codon] += 1
    else:
        dict_codons[codon] = 1

print dict_codons

To put it short, this code generates a generator that yields codons in frame 0, and counts them to store data in dictionary.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top