parsing sdf file, performance issue

Question 1

The main problem is that you have three nested loops: molecular documents, molecules and file parsing in the inner loop. That smells like trouble - I mean, quadratic complexity. You should move huge files parsing outside of inner loop and use set or dictionary for molecules. Something like this:

For each sdf file
For each line, if it is molecule definition
Check in dictionary of unfound molecules
If present, process it and remove from dictionary of unfound molecules

This way, you will parse each sdf file exactly once, and with each found molecule, speed will further increase.

Question 2

Let past_mol be a set, rather than a list. That will speed up

filine[i][0:-1] not in past_mol

since checking membership in a set is O(1), while checking membership in a list is O(n).

Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write.

It is generally better not to allow functions to modify global variables. sdf_grep modifies the global variable past_mol.

By adding past_mol to the arguments of sdf_grep you make it explicit that sdf_grep depends on the existence of past_mol (otherwise, sdf_grep is not really a standalone function).

If you pass past_mol in as a third argument to sdf_grep, then Python will make a new local variable named past_mol which will point to the same object as the global variable past_mol. Since that object is a set and a set is a mutable object, past_mol.add(sline) will affect the global variable past_mol as well.

As an added bonus, Python looks up local variables faster than global variables:

def using_local():
    x = set()
    for i in range(10**6):
        x

y = set
def using_global():
    for i in range(10**6):
        y

In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop

In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop

sdf_grep can be simplified greatly if you use a variable (let's call it found) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname and ends with "$$$$"):

import re
import sys
import os

def sdf_grep(molname, files, past_mol):
    chunk = []
    found = False
    with open(files, 'r') as filin:
        for line in filin:
            sline = line.rstrip()
            if sline == molname and sline not in past_mol:
                found = True
                past_mol.add(sline)
            elif sline == '$$$$':
                chunk.append(line)                
                found = False
            if found:
                chunk.append(line)
    return '\n'.join(chunk)


def main():
    past_mol = set()
    with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
        listmol = [line.split('\t')[0] for line in imp_listmol]
        print 'list ready... reading files'

    with open('test_ini.sdf', 'wa') as filout:
        for f in os.listdir("test"):
            print 'reading ' + f
            for molecule in listmol:
                if molecule not in past_mol:
                    filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))

        print len(past_mol)

if __name__ == '__main__':
    main()