High performance calculation of least squares difference from all possible combinations (n lists)

Question 1

My first thought here is that you're spending an awful lot of time building up lists that you don't need. At the very least, scrapping them would make things a lot simpler, but no guarantees it would actually make things faster:

r = itertools.product(*candidates_len)
z = itertools.product(*candidates_name)

min_dist = None
for item, names in itertools.izip(r, z):
  dist = 0
  for n in range(len(item)):
    for i in range(n, len(item)):
      dist += (item[n]-item[i])**2
  if min_dist is None or dist < min_dist:
    min_dist = dist
    best = item, names

print(best)

With your test data, the explicit lists are taking up gigabytes of memory. I'm not sure how much—my poor 4GB laptop went into swap thrash hell before it even finished generating the z list, and everything slowed to a crawl. The whole operation takes less time with itertools than just the setup part without it… That probably isn't true on a machine with, say, 16GB of RAM, but still, why use the memory if you don't need it?

My next thought is that all you're doing is calculating the LSD on a bunch of arrays. Do you have a huge array of tiny arrays? If so, can you de-jagged them (e.g., filling them out with None) and numpy the whole thing? On the other hand, if it's an array of big arrays, you probably want a list (or, as above, an iterator) of numpy arrays, so at least you can vectorize one dimension.

Either way, vectorization is the key to optimizing anything that involves simple operations over big arrays, and numpy will often do vectorization better than anything an expert C++-and-Fortran-and-platform-specific-assembly coder is likely to do manually.

Without either thinking through the code or trying to understand the algorithm at a deep level, my first attempt would be to generate r as a sequence (as in my code above) but of numpy row vectors (something like matrix(x, dtype=int) for x in itertools.product(*candidates_len)). Then you can calculate the differences on each item by just item - item.T, then sum up the squares of the lower triangle (which I'd have to look up to figure out how to do). You could probably then further improve the performance by figuring out a way to only calculate the lower triangle in the first place. The typical trick to this is figuring out how to get the lower-triangle sum into the diagonal as part of the vectorized operation, and then you just extract the diagonal, but that isn't always appropriate. See the broadcasting docs for some ideas on how to vectorize the inner loop without creating explicit matrices. Finally, see if there's a way you can create a 3D array out of the whole thing (which may mean padding the individual items out to a fixed width) and then vectorizing the whole operation. (The memory use won't be nearly as bad, as numpy only needs to allocate 4 bytes for each value instead of a whole PyObject… but it may still be bad enough that you lose more than you gain.) Sorry if this is all a little vague, but hopefully it's enough to get you started on the experimentation.

Another thought is that you can probably parallelize this. Any machine with enough memory to handle that massive list, I'm willing to bet it has at least 4 cores on it. And you've got a long sequence of completely independent operations, which is the easiest thing in the world to parallelize. As a first step, create a multiprocessing.Pool, and make the outer loop submit jobs to the pool instead of doing the work directly. You'll probably find that the jobs are too small, so you're drowning in overhead, but then you can always batch up each N items (either explicitly, or see the grouper recipe in the itertools docs), and make the job be "loop over these N items, and return the one with the min LSD". (It may take some tweaking to find the optimal N.) You can even do this together with the top-level numpy, by splitting the giant array up into chunks along the x axis and farming those out as jobs.

One more thought: Your algorithm is starting with a product of N*M, each element of which is of length N. Then, for each element, you loop over it twice. So, the best possible performance is going to be O(N^3*M). Is this really the right algorithm? If so, are you actually getting N^3*M performance from your algorithm? If the answer to either question is no, you shouldn't be trying to micro-optimize it. Only when you've actually got the most efficient algorithm, coded properly, is it worth doing things like vectorizing, avoiding excess work, moving tight loops into C++ and Fortran, etc. Otherwise, you're just going to come back and say "But it still blows up when I go to 4x as big as my last test run".

Question 2

THe first thing I would do would be to put as much of this as you can in Numpy arrays. Array based operations in Numpy are executed at more or less C speed. It looks like much of this could be done in Numpy to begin with...

If that doesn't get your blood flowing, then I would profile the code, and create a function in Cython for the bottel neck. Assuming that you could put a static type on the lists / arrays, Cython is probably going to be your best bet if you want to stay in the Pythonic world. I've personally seen 100x speedups for some bottlenecks using Cython.

Here's an example of an image convolution with Cython from their docs.