Why are KD-trees so damn slow for nearest neighbor search in point sets?

Question 1

From my experience, implementation quality varies widely, unfortunately. I have, however, never looked at the CGAL implementation.

The worst case for the k-d-tree usually is when due to incremental changes it becomes too unbalanced, and should be reloaded.

However, in general such trees are most efficient when you don't know the data distribution.

In your case it sounds as if a simple grid-based approach may be the best choice. If you want, you can consider a texture to be a dense 2d grid. So maybe you can find a 2d projection where a grid works good, and then intersect with this projection.

Question 2

Answers are not the place to ask questions, but your question is not a question, but a statement that the kd-tree of CGAL sucks.

Reading 1.8mio points of a geological data model, and computing the 50 clostest points for each of these points has the following performance on my Dell Precision, Windows7, 64bit, VC10:

reading the points from a file: 10 sec
Construction of the tree 1.3 sec
1.8 mio queries reporting the 50 closest points: 17 sec

Do you have similar performances. Would you expect a kd-tree to be faster?

Also I am wondering where your query points are, that is close to the surface, or close to the skeleton.

Question 3

I have done some research into fast KD-tree implementations a few months ago, and I agree with Anony-Mousse that quality (and "weight" of libraries) varies strongly. Here are some of my findings:

kdtree2 is a little known and pretty straightforward KD-tree implementation I found to be quite fast for 3D problems, especially if you allow it to copy and re-sort your data. Also, it is small and very easy to incorporate and adapt. Here is a paper on the implementation by the author (don't be put off by the mentioning of Fortran in the title). This is the library I ended up using. My colleagues benchmarked its speed for 3D points against VLFeat's KD-trees and another library I don't remember (many FLANN, see below) and it won.

FLANN has a reputation of being fast, and is used and recommended quite often recently. It aims at the higher dimensional case, where it offers approximate algorithms, but is also used in the Point Cloud Library which deals with 3D problems.

I did not experiment with CGAL since I found the library to be too heavyweight. I agree that CGAL has a good reputation, but I feel it occasionally suffers from oversophistication.

Question 4

Take a look at ApproxMVBB library under the MPL licence:

https://github.com/gabyx/ApproxMVBB:

The kdTree implementation should be comparable to PCL(FLANN) and might be even faster. (tests with PCL seemed to be faster with my implementation!)

Diclaimer: I am the owner of this library and by no means this library claims to be any faster and serious performance tests have not been conducted yet, but I am using this library sucessfully in granular rigid body dynamics where speed is king! However, this library is very small, and the kdTree implementation is very generic (see the examples) and lets you have custom splitting heurstics and other fancy stuff :-).

Similar improvements and considerations as in the nanoflann (direct data access etc., generic data, n-dimensional ) are implemented ... (see the KdTree.hpp) header.

Some Updates on Timing:
The example kdTreeFilteringcontains some small benchmarks: The standord bunny with 35947 points is loaded (fully working example in the repo out of the box) :

The results:

Bunny.txt

Loaded: 35947 points 
KDTree::  Exotic point traits , Vector3* +  id, start: =====
KdTree build took: 3.1685ms.
Tree Stats: 
     nodes      : 1199
     leafs      : 600
     tree level : 11
     avg. leaf data size : 29.9808
     min. leaf data size : 0
     max. leaf data size : 261
     min. leaf extent    : 0.00964587
     max. leaf extent    : 0.060337
SplitHeuristics Stats: 
     splits            : 599
     avg. split ratio  (0,0.5] : 0.5
     avg. point ratio  [0,0.5] : 0.22947
     avg. extent ratio (0,1]   : 0.616848
     tries / calls     : 599/716 = 0.836592
Neighbour Stats (if computed): 

     min. leaf neighbours    : 6
     max. leaf neighbours    : 69
     avg. leaf neighbours    : 18.7867
(Built with methods: midpoint, no split heuristic optimization loop)


Saving KdTree XML to: KdTreeResults.xml
KDTree:: Simple point traits , Vector3 only , start: =====
KdTree build took: 18.3371ms.
Tree Stats: 
     nodes      : 1199
     leafs      : 600
     tree level : 10
     avg. leaf data size : 29.9808
     min. leaf data size : 0
     max. leaf data size : 306
     min. leaf extent    : 0.01
     max. leaf extent    : 0.076794
SplitHeuristics Stats: 
     splits            : 599
     avg. split ratio  (0,0.5] : 0.448302
     avg. point ratio  [0,0.5] : 0.268614
     avg. extent ratio (0,1]   : 0.502048
     tries / calls     : 3312/816 = 4.05882
Neighbour Stats (if computed): 

     min. leaf neighbours    : 6
     max. leaf neighbours    : 43
     avg. leaf neighbours    : 21.11
(Built with methods: midpoint, median,geometric mean, full split heuristic optimization)

Lucy.txt model with 14 million points:

Loaded: 14027872 points 
KDTree::  Exotic point traits , Vector3* +  id, start: =====
KdTree build took: 3123.85ms.
Tree Stats: 
         nodes      : 999999
         leafs      : 500000
         tree level : 25
         avg. leaf data size : 14.0279
         min. leaf data size : 0
         max. leaf data size : 159
         min. leaf extent    : 2.08504
         max. leaf extent    : 399.26
SplitHeuristics Stats: 
         splits            : 499999
         avg. split ratio  (0,0.5] : 0.5
         avg. point ratio  [0,0.5] : 0.194764
         avg. extent ratio (0,1]   : 0.649163
         tries / calls     : 499999/636416 = 0.785648
(Built with methods: midpoint, no split heuristic optimization loop)

KDTree:: Simple point traits , Vector3 only , start: =====
KdTree build took: 7766.79ms.
Tree Stats: 
     nodes      : 1199
     leafs      : 600
     tree level : 10
     avg. leaf data size : 11699.6
     min. leaf data size : 0
     max. leaf data size : 35534
     min. leaf extent    : 9.87306
     max. leaf extent    : 413.195
SplitHeuristics Stats: 
     splits            : 599
     avg. split ratio  (0,0.5] : 0.297657
     avg. point ratio  [0,0.5] : 0.492414
     avg. extent ratio (0,1]   : 0.312965
     tries / calls     : 5391/600 = 8.985
Neighbour Stats (if computed): 

     min. leaf neighbours    : 4
     max. leaf neighbours    : 37
     avg. leaf neighbours    : 12.9233
(Built with methods: midpoint, median,geometric mean, full split heuristic optimization)

Take care about the interpretation! and look at the settings used in the example File. However comparing with results from other people: ~3100ms for 14*10⁶ points is quite slick :-)

Processor used: Intel® Core™ i7 CPU 970 @ 3.20GHz × 12 , 12GB Ram

Question 5

If the kdtree is fast for small sets, but "slow" for large (>100000?) sets, you may be suffering from flushing the processor caches. If the top few nodes are interleaved with rarely used leaf nodes then you will fit fewer heavily used nodes in the processor cache(s). This can be improved by minimising the size of the nodes and careful layout of the nodes in memory.. but eventually you will be flushing a fair number of nodes anyway. You can end up with access to main memory being the bottle-neck.

Valgrind tells me one version of my code is 5% fewer instructions, but I believe the stopwatch when it tells me it is about 10% slower for the same input. I suspect valgrind doing a full cache simulation would tell me the right answer.

If you are multi-threaded, you may want to encourage the threads to be doing searches in similar areas to reuse the cache... that assumes a single multi-core processor - multiple processors might want the opposite approach.

Hint: You pack more 32bit indexes in memory than 64bit pointers.