Question

I'm currently working on a high-performance python 2.7 project utilizing lists ten thousands elements in size. Obviously, every operation must be performed as fast as possible.

So, I have two lists: One of them is a list of unique arbitrary numbers, let's call it A, and the other one is a linear list starting with 1 and with the same length as the first list, named B, which represents indices in A (starting with 1)

Something like enumerate, starting with 1.

For example:

A = [500, 300, 400, 200, 100] # The order here is arbitrary, they can be any integers, but every integer can only exist once
B = [  1,   2,   3,   4,   5] # This is fixed, starting from 1, with exactly as many elements as A

If I have an element of B (called e_B) and want the corresponding element in A, I can simply do correspond_e_A = A[e_B - 1]. No problem.

But now I have a huge list of random, non-unique integers, and I want to know the indices of the integers that are in A, and what the corresponding elements in B are.

I think I have a reasonable solution for the first question:

indices_of_existing = numpy.nonzero(numpy.in1d(random_list, A))[0]

What is great about this approach is that there is no need to map() single operations, numpy's in1d just returns a list like [True, True, False, True, ...]. Using nonzero() I can get the indices of the elements in random_list that exist in A. Perfect, I think.

But for the second question, I'm stumped. I tried something like:

corresponding_e_B = map(lambda x: numpy.where(A==x)[0][0] + 1, random_list))

This correctly gives me the indices, but it's not optimal, because firstly I need a map(), secondly I need a lambda, and finally numpy.where() does not stop after the item was found once (remember, A has only unique elements), meaning that it scales horribly with huge datasets like mine.

I took a look at bisect, but it seems bisect only works with single requests, not with lists, meaning that I'd still have to use map() and build my list elementwise (that's slow, isn't it?)

Since I'm quite new to Python, I was hoping anyone here might have an idea? Maybe a library I don't know yet?

Was it helpful?

Solution

I think you would be well advised to use a hashtable for your lookups instead of numpy.in1d, which uses a O(n log n) merge sort as a preprocessing step.

>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]

This is Python 2, in Python 3 map and filter return generator-like objects, so you would need a list around filter if you want to get the result as a list.

I have tried to use builtin functions as much as possible to push the computational burden to the C side (assuming you use CPython). All the names are resolved upfront, so it should be pretty fast.

In general, for maximum performance, you might want to consider using PyPy, a great alternative Python implementation with JIT compilation.

A benchmark to compare multiple approaches is never a bad idea:

import sys
is_pypy = '__pypy__' in sys.builtin_module_names

import timeit
import random
if not is_pypy:
  import numpy
import bisect

n = 10000
m = 10000
q = 100

A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)

queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)

# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
  def fun11():
    for _ in range(q):
      numpy.nonzero(numpy.in1d(queries, A))[0]

def fun12():
  index = set(A)
  for _ in range(q):
    [i for i,x in enumerate(queries) if x in index]

# these three solve question two (find according entries of B)
def fun21():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    [index[i] for i in queries if i in index]

def fun22():
  index = { k:i for i,k in enumerate(A, 1) }
  for _ in range(q):
    list(filter(None, map(index.get, queries)))

def findit(keys, values, key):
  i = bisect.bisect(keys, key)
  if i == len(keys) or keys[i] != key:
    return None
  return values[i]

def fun23():
  keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
  for _ in range(q):
    list(filter(None, [findit(keys, values, x) for x in queries]))

if not is_pypy:
  # note this does not filter out nonexisting elements
  def fun24():
    I = numpy.argsort(A)
    ss = numpy.searchsorted
    maxi = len(I)
    for _ in range(q):   
      a = ss(A, queries, sorter=I)
      I[a[a<maxi]]

tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)

if is_pypy:
  # warmup
  for f in tests:
    timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)

# actual timing
for f in tests:
  print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))

Results:

$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131

You can tweak n (size of A), m (size of random_list) and q (number of queries) to your scenario. To my surprise, my attempt to be clever and use builtin functions instead of list comps has not paid off, since fun22 is not a lot faster than fun21 (only ~10% In Python 2 and ~25% in Python 3, but almost 75% slower in PyPy). A case of premature optimization here. I guess the difference is due to the fact that fun22 builds up an unnecessary temporary list per query in Python 2. We also see that binary search is pretty bad (look at fun23).

OTHER TIPS

def numpy_optimized(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    return I[Q]

This calculates the same thing as OP, but with the indices in matching order to the values queried, which I imagine is an improvement in functionality. It is up to twice as fast as OP's solution on my machine; which puts it slightly ahead of the non-pypy solutions, if I interpret your benchmarks correctly.

Or in case we cannot assume all index are present in values, and would like missing queries to be silently dropped:

def numpy_optimized_filtered(index, values):
    I = np.argsort(values)
    Q = np.searchsorted(values, index, sorter=I)
    Z = I[Q]
    return Z[values[Z]==index]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top