Question

I am trying to optimize my code since when I try to load huge dictionaries it becomes really slow. I think It's because it searchs for a key in the dictionary. I've been reading about python defaultdict and I think it might be a good improvement but I fail to implement it here. As you can see is a hierarchichal dictionary structure. Any hint will be appreciated.

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''
    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}
    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene in self.genes:
            #Gene in the structure
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
        else:
            self.genes[gene] = Gene(gene) 
            self.updateNgenes()
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
    def updateNgenes(self):
    #Updating the number of genes
        self.ngenes = len(self.genes.keys())    

The definitions of gene and Protein are:

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}
        self.updateProts()
    def updateProts(self):
        #Update number of proteins
        self.nproteins = len(self.proteins)
Was it helpful?

Solution

You cannot use a defaultdict because your __init__ methods require arguments.

This is probably one of your bottlenecks:

def updateNgenes(self):
#Updating the number of genes
    self.ngenes = len(self.genes.keys()) 

len(self.genes.keys()) creates a list of all keys before calculating length. This means that every time you add a gene, you create a list and throw it away. This list creation gets more and more expensive the more genes you have. To avoid creating an intermediate list, just do len(self.genes).

Better yet would be to make ngenes a property so it is only calculated when you need it.

@property
def ngenes(self):
    return len(self.genes)

The same can be done with nproteins in the Gene class.

Here is your code refactored:

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''

    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}

    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene not in self.genes:
            self.genes[gene] = Gene(gene) 
        self.genes[gene].proteins[protname] = Protein(protname, len)

    @property
    def ngenes(self):
        return len(self.genes)

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}

    @property
    def nproteins(self):
        return len(self.proteins)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top