I think you would be well advised to use a hashtable for your lookups instead of numpy.in1d
, which uses a O(n log n)
merge sort as a preprocessing step.
>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]
This is Python 2, in Python 3 map
and filter
return generator-like objects, so you would need a list
around filter if you want to get the result as a list.
I have tried to use builtin functions as much as possible to push the computational burden to the C side (assuming you use CPython). All the names are resolved upfront, so it should be pretty fast.
In general, for maximum performance, you might want to consider using PyPy, a great alternative Python implementation with JIT compilation.
A benchmark to compare multiple approaches is never a bad idea:
import sys
is_pypy = '__pypy__' in sys.builtin_module_names
import timeit
import random
if not is_pypy:
import numpy
import bisect
n = 10000
m = 10000
q = 100
A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)
queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)
# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
def fun11():
for _ in range(q):
numpy.nonzero(numpy.in1d(queries, A))[0]
def fun12():
index = set(A)
for _ in range(q):
[i for i,x in enumerate(queries) if x in index]
# these three solve question two (find according entries of B)
def fun21():
index = { k:i for i,k in enumerate(A, 1) }
for _ in range(q):
[index[i] for i in queries if i in index]
def fun22():
index = { k:i for i,k in enumerate(A, 1) }
for _ in range(q):
list(filter(None, map(index.get, queries)))
def findit(keys, values, key):
i = bisect.bisect(keys, key)
if i == len(keys) or keys[i] != key:
return None
return values[i]
def fun23():
keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
for _ in range(q):
list(filter(None, [findit(keys, values, x) for x in queries]))
if not is_pypy:
# note this does not filter out nonexisting elements
def fun24():
I = numpy.argsort(A)
ss = numpy.searchsorted
maxi = len(I)
for _ in range(q):
a = ss(A, queries, sorter=I)
I[a[a<maxi]]
tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)
if is_pypy:
# warmup
for f in tests:
timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)
# actual timing
for f in tests:
print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))
Results:
$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131
You can tweak n
(size of A
), m
(size of random_list
) and q
(number of queries) to your scenario. To my surprise, my attempt to be clever and use builtin functions instead of list comps has not paid off, since fun22
is not a lot faster than fun21
(only ~10% In Python 2 and ~25% in Python 3, but almost 75% slower in PyPy). A case of premature optimization here. I guess the difference is due to the fact that fun22
builds up an unnecessary temporary list per query in Python 2. We also see that binary search is pretty bad (look at fun23
).