Best Performance-Critical Algorithm for Solving Nearest Neighbor

https://stackoverflow.com/questions/1639755

08-07-2019
|

Question

We have a list of x,y pairs. Every pair represents a point on a 2D space. I want to find the closest point from this list, to a specific point xq,yq. What is the best performance-critical algorithm for this problem? Lisp of points is not going to change; which means I do not need to perform insertion and deletion. I want just to find the nearest neighbor of a target xq,yq point in this set.

Edit 1: Thanks to all! As Stephan202 has guessed correctly, I want to do this repeatedly; like a function. An the list is not necessarily sorted (In fact I do not understand how can it be sorted? Like a table with a primary key of 2 columns a and y? If that helps then I will sort it).

I will construct the data structure based on the list one time, and then I will use this generated data structure in the function (if this process itself is relevant).

Thank you Jacob; It seems that KD-Tree data structure is a good candidate for being the answer (And I feel it is. I will update when I get some relevant results).

Edit 2: I have found that, this problem is named "nearest neighbor"!

Edit 3: First title was "In Search of an Algorithm (for Spatial-Querying and Spatial-Indexing) (Nearest Neighbor)"; I have chose a new title: "Best Performance-Critical Algorithm for Solving Nearest Neighbor". Since I do not want to perform insertion and deletion operation on my initial data and I want just the nearest one from them to a new point (which is not going to be inserted), I chose to (currently) working on KD-Trees. Thanks to all!

Solution

As Stephan202 noted, if you plan to find the closest-match for more than one point, you should use a tree.

I would recommend a KD-tree, whose implementation can be easily found in several packages like OpenCV 2.0. Or you could implement one yourself!

EDIT: I had asked a question about kd-tree implementations here - might be useful.

EDIT: KD-trees have been widely used successfully for NN searches :) - Also, if you're willing to accept approximate matches, you can use Fast Library for Approximate Nearest Neigbor (FLANN). The FLANN implementation is present in OpenCV 2.0.

If you don't want approximate answers you can tweak the FLANN parameters to search the entire tree.

OTHER TIPS

If the query point (xq, yq) varies and the list doesn't, you need to calculate the Voronoi diagram of the list of points. This will give you a set of polygons or "cells" (some of which are infinite); each polygon corresponds to a point from the original list, called the "site" of that cell. Any point which lies entirely inside of one polygon is closer to the site of that polygon than it is to the other sites on the original list. Any point on a boundary between two polygons lies equally distant from each site.

Once you've gotten that far, you need an easy way to figure out which polygon you're in. This is known as the point location problem.

A really, really good book for this kind of thing is Computational Geometry: Algorithms and Applications. They discuss both the Voronoi diagram calculation and the trapezoidal slab method of point location in detail.

If you don't want to do the code yourself, and you shouldn't, then try to get a library like CGAL that will do most of the work for you. This probably applies to the KD-tree answer as well, but I don't specifically know.

You need a spatial index.

If you roll your own, you can do a lot worse than picking the R-Tree or Quad-tree algorithms.

I would go with a quadtree. It is the most simple spatial structure. In 2 dimensions i would generally recommend quadtree instead of kd-tree, because it is simpler, faster. Its downside is more memory consumption if the number of dimensions is high, but in case of 2 dimensions the difference is not significant.

There is a nice optimization trick if your coordinates are floating point typed: In a query you first will have to find the leaf-node that contains the point to which the most near point is asked. To do this you will have to go in the tree from the root to the leaf - in each iteration deciding which child-node to step on. Store the identifiers/addresses of the child-nodes in a 4-sized array in the Node structure. Digitize the point coordinates in the query algorithm. Then you will be able to find the proper sub-node just by indexing the array by 2 proper bits of the digitized point coordinates. Digitizing is fast: implement it with a simple static_cast.

But first implement the quadtree without optimization because it is easy to make a bug with the bit-operations. Even without this optimization, it still will be the fastest solution.

Iterate through every other point using the distance formula to find the minimum distance from Q (xq,yq).

However, you haven't given enough information for a performance-critical answer.

For example, if Q is a VERY common point, you might want to calculate the distance to Q and store it with each point.

Second example, if you have a huge number of points, you might organize the points into sections and start with points only in the same section and adjacent sections to the section containing Q.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow