Approximate Estimation of Distance Matrices

https://stackoverflow.com/questions/3104331

29-09-2019
|

Question

I have a set of N objects, and I'd like to compute a NxN distance matrix. Sometimes my set of N objects is very large, and I'd like to compute an approximation to the NxN distance matrix by only computing a subset of the distance comparisons.

Can anyone point me in the direction of something that calculates approximations to a full distance matrix? I have some ideas in mind, but I'd like to avoid re-inventing the wheel.

Edit: An example of the type of algorithm would take advantage of the fact that if there is a very small distance between object A and object B, and there is a very small distance between object B and object C, there has to be a somewhat short distance between objects A and C.

Solution

Honestly, I think it depends how close you want your approximation to be and how big your subset is. If you just want some overall feel of what the matrix will look like, you can do simple linear interpolation on a random subset (including the maximal and minimal nodes) getting pretty accurate (tm) results.

$linear interpolation$

I think the real trick here is figuring out the heuristic (linear, quadratic, etc interpolation) and the subset size. You could also figure out the distance matrices of various subsets and then interpolate those matrices with some method (linear, spherical linear, cubic).

Depending on your initial sample, it's pretty much an heuristic trial and error until you go "oh that's good enough for what I need".

OTHER TIPS

Are your "objects" on a network? If the objects are in a network, you can use this or this that yields the all-pairs shortest paths. If not, you're pretty much stuck with calculated all the n x n distances, I think.

The solution you require is similar to what we commonly see in a graph, you can use All pair shortest path for finding the distance, you can also look at johnson's algorithm

I had this same question and ended up writing Python code for it:

https://github.com/jpeterbaker/lazyDistance

README.md explains how the triangle inequality can be used to update upper and lower bounds for each distance.

Just run the Python file as a script for an example in 2-dimensional space. The plotted lines are the only distances that were actually calculated.

In my version, the time savings aren't about having a large number of objects. As I've written it, it's a O(n^4) algorithm, so it's actually worse than just calculating all distances if the number of objects is large. But my method will save time when you have a modest number of objects and the distance function is very expensive to calculate. It assumes that it is faster to do several O(n^2) operations rather than a single distance measurement.

If n is large, you could look for cheaper methods to decide which distance to calculate next (that don't involve arithmetic with n^2 entries of distance bounds matrices). You also may not need to update all 2*n^2 bounds every time that this code does.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow