Question

I have a list of points, with floating point coordinates, of which I've computed the square of the Euclidean distance between these points. I have not computed the actual Euclidean distance between these points because computing a square root is an expensive operation. So, I have a list of floating point squares {a², b²...}.

My goal is to find the arithmetic mean of the actual Euclidean distance values, (a + b + ...) / n).

Is there a way to avoid computing square root for every element?

Was it helpful?

Solution

If you don't need to know the exact answer, you should read this paper:

http://www2.mta.ac.il/~adish/Pubs/Papers/av-metric-r3.pdf (archive.org link to prevent link rot)

First, it's clear from the paper that there is no way to do this to get an exact answer that's faster than the brute force approach:

Our aim is beating the obvious algorithm that computes the exact value of the aforementioned average (by considering all pairs of points). But, unlike in the graph theoretic setting (cf. [4]), we cannot hope for approximation algorithms that run in time that is sub-linear in the number of points (because a single “exceptional” point may dominate the value of the average of all pairwise distances). Thus, we seek approximation algorithms that run in time that is almost linear in the number of points. We consider two algorithmic approaches.

Then they have an easy answer: rather than computing the Euclidean distance for all pairs of points, you can get a sqrt(d) approximation by averaging the distances of the coordinates (unfortunately, Programmers.SE doesn't have MathJax so a screenshot will have to do):

enter image description here

tl;dr math language: basically the formula is just saying add all the pairs of distances between the coordinates. For example, sqrt((x2-x1)^2 + (y2-y1)^2) just becomes x2 - x1 + y2 - y1, and your answer will only be off by a factor of sqrt(d), which in this case is sqrt(2).

Then, the paper goes on to discuss a random sampling algorithm, which is more accurate.

Random sampling

I recommend reading the paper to see why this works, they explain it better than I can.

OTHER TIPS

1. change presentation data

Don't save squares but the square roots when inserting them, since a*a is cheaper than sqrt(aa)

2. fixed cache

I assume that just integers are used.

If you know, that there are many duplicates, maybe between 1*1 and 1000*1000 e.g. than caching them in a hashmap might accelerate computation.

3. LRU-cache

If you know that there are many duplicates in it, then a LRU-Cache might help you.

4. approximation

Instead of using sqrt you could implement it yourself, but only with a few iterations.

It depends on how exact your average needs to be. If there is a large disparity between the size of the cubes you can "ignore" the smaller cubes and not calculate their sqrt and still get a good average estimate:

(1000 + 1000 + 1000 + 0.001) / 4 = 750.00025

Ignore the last value:

(1000 + 1000 + 1000 + 0) / 4 = 750 

Is there a way to avoid computing square root for every element?

Absolutely. Raise the value to the power of ½.

Great. Is this faster?

It is a well trodden road, but unfortunately I have yet to find any language, platform or CPU where the default implementation of such is more efficient.

Various algorithms exist if you wanted to roll your own.

Aitch proposed using a cache if you have integers, but you stated you have floats. A cache is however still possible if:

  • the points whose distance you are calculating are on a grid (so the set of likely values is small)

- or -

  • the resulting value does not need high accuracy (in which case you may use intervals as keys in your cache).

In the second case, if you keep pairs (a²;b²) → (a;b) with a and b close, you can also get a lower and upper bound on the average, if that's useful to the client. (It would be helpful to mention what that average is needed for)

Licensed under: CC-BY-SA with attribution
scroll top