Question

I've wrtien a script which read different files and search molecular ID in big sdf databases (about 4.0 GB each).

the idea of this script is to copy every molecules from a list of id (about 287212 molecules) from my original databases to a new one in a way to have only one single copy of each molecule (in this case, the first copy encountered)

I've writen this script:

import re
import sys
import os

def sdf_grep (molname,files):
    filin = open(files, 'r')
    filine= filin.readlines()
    for i in range(0,len(filine)):
        if filine[i][0:-1] == molname and filine[i][0:-1] not in past_mol:
            past_mol.append(filine[i][0:-1])
            iterate = 1
            while iterate == 1:
                if filine[i] == "$$$$\n":
                    filout.write(filine[i])
                    iterate = 0
                    break
                else:
                    filout.write(filine[i])
                i = i+1
        else:
            continue
    filin.close()

mol_dock = os.listdir("test")
listmol = []
past_mol = []
imp_listmol = open("consensus_sorted_surflex.txt", 'r')
filout = open('test_ini.sdf','wa')

for line in imp_listmol:
    listmol.append(line.split('\t')[0])
print 'list ready... reading files'
imp_listmol.close()

for f in mol_dock:
    print 'reading '+f
    for molecule in listmol:
        if molecule not in past_mol:
            sdf_grep(molecule , 'test/'+f) 

print len(past_mol)
filout.close()

it works perfectely, but it is very slow... too slow for what I need. Is there a way to rewrite this script in a way that can reduce the computation time?

thank you very much.

Was it helpful?

Solution

The main problem is that you have three nested loops: molecular documents, molecules and file parsing in the inner loop. That smells like trouble - I mean, quadratic complexity. You should move huge files parsing outside of inner loop and use set or dictionary for molecules. Something like this:

  1. For each sdf file
  2. For each line, if it is molecule definition
  3. Check in dictionary of unfound molecules
  4. If present, process it and remove from dictionary of unfound molecules

This way, you will parse each sdf file exactly once, and with each found molecule, speed will further increase.

OTHER TIPS

Let past_mol be a set, rather than a list. That will speed up

filine[i][0:-1] not in past_mol

since checking membership in a set is O(1), while checking membership in a list is O(n).


Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write.


It is generally better not to allow functions to modify global variables. sdf_grep modifies the global variable past_mol.

By adding past_mol to the arguments of sdf_grep you make it explicit that sdf_grep depends on the existence of past_mol (otherwise, sdf_grep is not really a standalone function).

If you pass past_mol in as a third argument to sdf_grep, then Python will make a new local variable named past_mol which will point to the same object as the global variable past_mol. Since that object is a set and a set is a mutable object, past_mol.add(sline) will affect the global variable past_mol as well.

As an added bonus, Python looks up local variables faster than global variables:

def using_local():
    x = set()
    for i in range(10**6):
        x

y = set
def using_global():
    for i in range(10**6):
        y

In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop

In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop

sdf_grep can be simplified greatly if you use a variable (let's call it found) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname and ends with "$$$$"):

import re
import sys
import os

def sdf_grep(molname, files, past_mol):
    chunk = []
    found = False
    with open(files, 'r') as filin:
        for line in filin:
            sline = line.rstrip()
            if sline == molname and sline not in past_mol:
                found = True
                past_mol.add(sline)
            elif sline == '$$$$':
                chunk.append(line)                
                found = False
            if found:
                chunk.append(line)
    return '\n'.join(chunk)


def main():
    past_mol = set()
    with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
        listmol = [line.split('\t')[0] for line in imp_listmol]
        print 'list ready... reading files'

    with open('test_ini.sdf', 'wa') as filout:
        for f in os.listdir("test"):
            print 'reading ' + f
            for molecule in listmol:
                if molecule not in past_mol:
                    filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))

        print len(past_mol)

if __name__ == '__main__':
    main()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top