Python: Fast mapping and lookup between two lists

Question 1

I think you would be well advised to use a hashtable for your lookups instead of numpy.in1d, which uses a O(n log n) merge sort as a preprocessing step.

>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]

This is Python 2, in Python 3 map and filter return generator-like objects, so you would need a list around filter if you want to get the result as a list.

I have tried to use builtin functions as much as possible to push the computational burden to the C side (assuming you use CPython). All the names are resolved upfront, so it should be pretty fast.

In general, for maximum performance, you might want to consider using PyPy, a great alternative Python implementation with JIT compilation.

A benchmark to compare multiple approaches is never a bad idea:

import sys
is_pypy = '__pypy__' in sys.builtin_module_names

import timeit
import random
if not is_pypy:
  import numpy
import bisect

n = 10000
m = 10000
q = 100

A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)

queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)

# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
  def fun11():
    for _ in range(q):
      numpy.nonzero(numpy.in1d(queries, A))[0]

def fun12():
  index = set(A)
  for _ in range(q):
    [i for i,x in enumerate(queries) if x in index]

# these three solve question two (find according entries of B)
def fun21():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    [index[i] for i in queries if i in index]

def fun22():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    list(filter(None, map(index.get, queries)))

def findit(keys, values, key):
  i = bisect.bisect(keys, key)
  if i == len(keys) or keys[i] != key:
    return None
  return values[i]

def fun23():
  keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
  for _ in range(q):
    list(filter(None, [findit(keys, values, x) for x in queries]))

if not is_pypy:
  # note this does not filter out nonexisting elements
  def fun24():
    I = numpy.argsort(A)
    ss = numpy.searchsorted
    maxi = len(I)
    for _ in range(q):   
      a = ss(A, queries, sorter=I)
      I[a[a<maxi]]

tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)

if is_pypy:
  # warmup
  for f in tests:
    timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)

# actual timing
for f in tests:
  print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))

Results:

$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131

You can tweak n (size of A), m (size of random_list) and q (number of queries) to your scenario. To my surprise, my attempt to be clever and use builtin functions instead of list comps has not paid off, since fun22 is not a lot faster than fun21 (only ~10% In Python 2 and ~25% in Python 3, but almost 75% slower in PyPy). A case of premature optimization here. I guess the difference is due to the fact that fun22 builds up an unnecessary temporary list per query in Python 2. We also see that binary search is pretty bad (look at fun23).

Question 2

def numpy_optimized(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    return I[Q]

This calculates the same thing as OP, but with the indices in matching order to the values queried, which I imagine is an improvement in functionality. It is up to twice as fast as OP's solution on my machine; which puts it slightly ahead of the non-pypy solutions, if I interpret your benchmarks correctly.

Or in case we cannot assume all index are present in values, and would like missing queries to be silently dropped:

def numpy_optimized_filtered(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    Z = I[Q]
    return Z[values[Z]==index]