سؤال

I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?

import sys
#load ids
ids = set( x.strip() for x in open(idfile) )

for line in sys.stdin:
    id=line.strip()
    if id in ids:
        #print fastq
        print id
        #update ids
        ids.remove( id )
هل كانت مفيدة؟

المحلول

set is as fast as it gets.

However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.

If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.

نصائح أخرى

As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:

#!/usr/bin/env python

import random

# create two million-element sets of random numbers
a = set(random.sample(xrange(10000000),1000000))
b = set(random.sample(xrange(10000000),1000000))
# a intersection b
c = a & b
# a difference c
d = list(a - c) 
print "set d is all remaining elements in a not common to a intersection b"
print "length of d is %s" % len(d)

The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership

You should try to split your data to make the search faster. The tree strcuture would allow you to find very quickly if the data is present or not.

For example, start with a simple map that links the first letter with all the keys starting with that letter, thus you don't have to search all the keys but only a smaller part of them.

This would look like :

ids = {}
for id in open(idfile):
    ids.setdefault(id[0], set()).add(id)

for line in sys.stdin:
    id=line.strip()
    if id in ids.get(id[0], set()):
       #print fastq
       print id
       #update ids
       ids[id[0]].remove( id )

Creation will be a bit slower but search should be much faster (I would expect 20 times faster, if the fisrt character of your keys is well distributed and not always the same).

This is a first step, you could do the same thing with the second character and so on, search would then just be walking the tree with each letter...

As mentioned by urschrei, you should "vectorize" the check. It is faster to check for the presence of a million elements once (as that is done in C) than to do the check for one element a million times.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top