Let past_mol
be a set, rather than a list. That will speed up
filine[i][0:-1] not in past_mol
since checking membership in a set is O(1), while checking membership in a list is O(n).
Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write
.
It is generally better not to allow functions to modify global variables. sdf_grep
modifies the global variable past_mol
.
By adding past_mol
to the arguments of sdf_grep
you make it explicit that sdf_grep
depends on the existence of past_mol
(otherwise, sdf_grep
is not really a standalone function).
If you pass past_mol
in as a third argument to sdf_grep
, then Python will make a new local variable named past_mol
which will point to the same object as the global variable past_mol
. Since that object is a set and a set is a mutable object, past_mol.add(sline)
will affect the global variable past_mol
as well.
As an added bonus, Python looks up local variables faster than global variables:
def using_local():
x = set()
for i in range(10**6):
x
y = set
def using_global():
for i in range(10**6):
y
In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop
In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop
sdf_grep
can be simplified greatly if you use a variable (let's call it found
) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname
and ends with "$$$$"
):
import re
import sys
import os
def sdf_grep(molname, files, past_mol):
chunk = []
found = False
with open(files, 'r') as filin:
for line in filin:
sline = line.rstrip()
if sline == molname and sline not in past_mol:
found = True
past_mol.add(sline)
elif sline == '$$$$':
chunk.append(line)
found = False
if found:
chunk.append(line)
return '\n'.join(chunk)
def main():
past_mol = set()
with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
listmol = [line.split('\t')[0] for line in imp_listmol]
print 'list ready... reading files'
with open('test_ini.sdf', 'wa') as filout:
for f in os.listdir("test"):
print 'reading ' + f
for molecule in listmol:
if molecule not in past_mol:
filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))
print len(past_mol)
if __name__ == '__main__':
main()