How can the Euclidean distance be calculated with NumPy?
-
05-07-2019 - |
Question
I have two points in 3D:
(xa, ya, za)
(xb, yb, zb)
And I want to calculate the distance:
dist = sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)
What's the best way to do this with NumPy, or with Python in general? I have:
a = numpy.array((xa ,ya, za))
b = numpy.array((xb, yb, zb))
Solution
Use numpy.linalg.norm
:
dist = numpy.linalg.norm(a-b)
OTHER TIPS
There's a function for that in SciPy. It's called Euclidean.
Example:
from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)
For anyone interested in computing multiple distances at once, I've done a little comparison using perfplot (a small project of mine). It turns out that
a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->i', a_min_b, a_min_b))
computes the distances of the rows in a
and b
fastest. This actually holds true for just one row as well!
Code to reproduce the plot:
import matplotlib
import numpy
import perfplot
from scipy.spatial import distance
def linalg_norm(data):
a, b = data
return numpy.linalg.norm(a-b, axis=1)
def sqrt_sum(data):
a, b = data
return numpy.sqrt(numpy.sum((a-b)**2, axis=1))
def scipy_distance(data):
a, b = data
return list(map(distance.euclidean, a, b))
def mpl_dist(data):
a, b = data
return list(map(matplotlib.mlab.dist, a, b))
def sqrt_einsum(data):
a, b = data
a_min_b = a - b
return numpy.sqrt(numpy.einsum('ij,ij->i', a_min_b, a_min_b))
perfplot.show(
setup=lambda n: numpy.random.rand(2, n, 3),
n_range=[2**k for k in range(20)],
kernels=[linalg_norm, scipy_distance, mpl_dist, sqrt_sum, sqrt_einsum],
logx=True,
logy=True,
xlabel='len(x), len(y)'
)
Another instance of this problem solving method:
def dist(x,y):
return numpy.sqrt(numpy.sum((x-y)**2))
a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
dist_a_b = dist(a,b)
I want to expound on the simple answer with various performance notes. np.linalg.norm will do perhaps more than you need:
dist = numpy.linalg.norm(a-b)
Firstly - this function is designed to work over a list and return all of the values, e.g. to compare the distance from pA
to the set of points sP
:
sP = set(points)
pA = point
distances = np.linalg.norm(sP - pA, ord=2, axis=1.) # 'distances' is a list
Remember several things:
- Python function calls are expensive.
- [Regular] Python doesn't cache name lookups.
So
def distance(pointA, pointB):
dist = np.linalg.norm(pointA - pointB)
return dist
isn't as innocent as it looks.
>>> dis.dis(distance)
2 0 LOAD_GLOBAL 0 (np)
2 LOAD_ATTR 1 (linalg)
4 LOAD_ATTR 2 (norm)
6 LOAD_FAST 0 (pointA)
8 LOAD_FAST 1 (pointB)
10 BINARY_SUBTRACT
12 CALL_FUNCTION 1
14 STORE_FAST 2 (dist)
3 16 LOAD_FAST 2 (dist)
18 RETURN_VALUE
Firstly - every time we call it, we have to do a global lookup for "np", a scoped lookup for "linalg" and a scoped lookup for "norm", and the overhead of merely calling the function can equate to dozens of python instructions.
Lastly, we wasted two operations on to store the result and reload it for return...
First pass at improvement: make the lookup faster, skip the store
def distance(pointA, pointB, _norm=np.linalg.norm):
return _norm(pointA - pointB)
We get the far more streamlined:
>>> dis.dis(distance)
2 0 LOAD_FAST 2 (_norm)
2 LOAD_FAST 0 (pointA)
4 LOAD_FAST 1 (pointB)
6 BINARY_SUBTRACT
8 CALL_FUNCTION 1
10 RETURN_VALUE
The function call overhead still amounts to some work, though. And you'll want to do benchmarks to determine whether you might be better doing the math yourself:
def distance(pointA, pointB):
return (
((pointA.x - pointB.x) ** 2) +
((pointA.y - pointB.y) ** 2) +
((pointA.z - pointB.z) ** 2)
) ** 0.5 # fast sqrt
On some platforms, **0.5
is faster than math.sqrt
. Your mileage may vary.
**** Advanced performance notes.
Why are you calculating distance? If the sole purpose is to display it,
print("The target is %.2fm away" % (distance(a, b)))
move along. But if you're comparing distances, doing range checks, etc., I'd like to add some useful performance observations.
Let’s take two cases: sorting by distance or culling a list to items that meet a range constraint.
# Ultra naive implementations. Hold onto your hat.
def sort_things_by_distance(origin, things):
return things.sort(key=lambda thing: distance(origin, thing))
def in_range(origin, range, things):
things_in_range = []
for thing in things:
if distance(origin, thing) <= range:
things_in_range.append(thing)
The first thing we need to remember is that we are using Pythagoras to calculate the distance (dist = sqrt(x^2 + y^2 + z^2)
) so we're making a lot of sqrt
calls. Math 101:
dist = root ( x^2 + y^2 + z^2 )
:.
dist^2 = x^2 + y^2 + z^2
and
sq(N) < sq(M) iff M > N
and
sq(N) > sq(M) iff N > M
and
sq(N) = sq(M) iff N == M
In short: until we actually require the distance in a unit of X rather than X^2, we can eliminate the hardest part of the calculations.
# Still naive, but much faster.
def distance_sq(left, right):
""" Returns the square of the distance between left and right. """
return (
((left.x - right.x) ** 2) +
((left.y - right.y) ** 2) +
((left.z - right.z) ** 2)
)
def sort_things_by_distance(origin, things):
return things.sort(key=lambda thing: distance_sq(origin, thing))
def in_range(origin, range, things):
things_in_range = []
# Remember that sqrt(N)**2 == N, so if we square
# range, we don't need to root the distances.
range_sq = range**2
for thing in things:
if distance_sq(origin, thing) <= range_sq:
things_in_range.append(thing)
Great, both functions no-longer do any expensive square roots. That'll be much faster. We can also improve in_range by converting it to a generator:
def in_range(origin, range, things):
range_sq = range**2
yield from (thing for thing in things
if distance_sq(origin, thing) <= range_sq)
This especially has benefits if you are doing something like:
if any(in_range(origin, max_dist, things)):
...
But if the very next thing you are going to do requires a distance,
for nearby in in_range(origin, walking_distance, hotdog_stands):
print("%s %.2fm" % (nearby.name, distance(origin, nearby)))
consider yielding tuples:
def in_range_with_dist_sq(origin, range, things):
range_sq = range**2
for thing in things:
dist_sq = distance_sq(origin, thing)
if dist_sq <= range_sq: yield (thing, dist_sq)
This can be especially useful if you might chain range checks ('find things that are near X and within Nm of Y', since you don't have to calculate the distance again).
But what about if we're searching a really large list of things
and we anticipate a lot of them not being worth consideration?
There is actually a very simple optimization:
def in_range_all_the_things(origin, range, things):
range_sq = range**2
for thing in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
Whether this is useful will depend on the size of 'things'.
def in_range_all_the_things(origin, range, things):
range_sq = range**2
if len(things) >= 4096:
for thing in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
elif len(things) > 32:
for things in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2 + (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
else:
... just calculate distance and range-check it ...
And again, consider yielding the dist_sq. Our hotdog example then becomes:
# Chaining generators
info = in_range_with_dist_sq(origin, walking_distance, hotdog_stands)
info = (stand, dist_sq**0.5 for stand, dist_sq in info)
for stand, dist in info:
print("%s %.2fm" % (stand, dist))
I find a 'dist' function in matplotlib.mlab, but I don't think it's handy enough.
I'm posting it here just for reference.
import numpy as np
import matplotlib as plt
a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
# Distance between a and b
dis = plt.mlab.dist(a, b)
It can be done like the following. I don't know how fast it is, but it's not using NumPy.
from math import sqrt
a = (1, 2, 3) # Data point 1
b = (4, 5, 6) # Data point 2
print sqrt(sum( (a - b)**2 for a, b in zip(a, b)))
You can just subtract the vectors and then innerproduct.
Following your example,
a = numpy.array((xa, ya, za))
b = numpy.array((xb, yb, zb))
tmp = a - b
sum_squared = numpy.dot(tmp.T, tmp)
result sqrt(sum_squared)
It is simple code and is easy to understand.
I like np.dot
(dot product):
a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
distance = (np.dot(a-b,a-b))**.5
Starting Python 3.8
, the math
module directly provides the dist
function, which returns the euclidean distance between two points (given as a tuple of coordinates):
from math import dist
dist((1, 2, 6), (-2, 3, 2)) # 5.0990195135927845
If you're working with lists instead of tuples:
dist(tuple([1, 2, 6]), tuple([-2, 3, 2]))
Having a
and b
as you defined them, you can use also:
distance = np.sqrt(np.sum((a-b)**2))
A nice one-liner:
dist = numpy.linalg.norm(a-b)
However, if speed is a concern I would recommend experimenting on your machine. I found that using the math
library's sqrt
with the **
operator for the square is much faster on my machine than the one-liner NumPy solution.
I ran my tests using this simple program:
#!/usr/bin/python
import math
import numpy
from random import uniform
def fastest_calc_dist(p1,p2):
return math.sqrt((p2[0] - p1[0]) ** 2 +
(p2[1] - p1[1]) ** 2 +
(p2[2] - p1[2]) ** 2)
def math_calc_dist(p1,p2):
return math.sqrt(math.pow((p2[0] - p1[0]), 2) +
math.pow((p2[1] - p1[1]), 2) +
math.pow((p2[2] - p1[2]), 2))
def numpy_calc_dist(p1,p2):
return numpy.linalg.norm(numpy.array(p1)-numpy.array(p2))
TOTAL_LOCATIONS = 1000
p1 = dict()
p2 = dict()
for i in range(0, TOTAL_LOCATIONS):
p1[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
p2[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
total_dist = 0
for i in range(0, TOTAL_LOCATIONS):
for j in range(0, TOTAL_LOCATIONS):
dist = fastest_calc_dist(p1[i], p2[j]) #change this line for testing
total_dist += dist
print total_dist
On my machine, math_calc_dist
runs much faster than numpy_calc_dist
: 1.5 seconds versus 23.5 seconds.
To get a measurable difference between fastest_calc_dist
and math_calc_dist
I had to up TOTAL_LOCATIONS
to 6000. Then fastest_calc_dist
takes ~50 seconds while math_calc_dist
takes ~60 seconds.
You can also experiment with numpy.sqrt
and numpy.square
though both were slower than the math
alternatives on my machine.
My tests were run with Python 2.6.6.
Here's some concise code for Euclidean distance in Python given two points represented as lists in Python.
def distance(v1,v2):
return sum([(x-y)**2 for (x,y) in zip(v1,v2)])**(0.5)
import numpy as np
from scipy.spatial import distance
input_arr = np.array([[0,3,0],[2,0,0],[0,1,3],[0,1,2],[-1,0,1],[1,1,1]])
test_case = np.array([0,0,0])
dst=[]
for i in range(0,6):
temp = distance.euclidean(test_case,input_arr[i])
dst.append(temp)
print(dst)
import math
dist = math.hypot(math.hypot(xa-xb, ya-yb), za-zb)
You can easily use the formula
distance = np.sqrt(np.sum(np.square(a-b)))
which does actually nothing more than using Pythagoras' theorem to calculate the distance, by adding the squares of Δx, Δy and Δz and rooting the result.
Calculate the Euclidean distance for multidimensional space:
import math
x = [1, 2, 6]
y = [-2, 3, 2]
dist = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(x, y)]))
5.0990195135927845
Find difference of two matrices first. Then, apply element wise multiplication with numpy's multiply command. After then, find summation of the element wise multiplied new matrix. Finally, find square root of the summation.
def findEuclideanDistance(a, b):
euclidean_distance = a - b
euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance))
euclidean_distance = np.sqrt(euclidean_distance)
return euclidean_distance