Approximate nearest neighbor algorithm for Octrees

Question

This is not the exact answer, but an approximate ( to use the terms of the subject :) ).

It was too big to write it an comment, and I think is good information for a start.

The paper mentions that Voronoi diagrams don't expand in higher dimensions than three and it implies that octrees do. That's wrong, in terms of terminology.

An octree is defined in R^3. Simply put it, you see this kind of data structure in 2D, where we have a quadtree. These kind of trees have 2^D children per node, where D is the dimension. This means:

 1. 2D: 2^D children per node, i.e. 4 children per node.

 2. 3D: 2^D children per node, i.e. 8 children per node.

 3. and so on.

For exampe, octree, comes from the Greek word octo, which means 8 and it implies that this tree has 8 children per node.

I had implemented this kind of tree for NN (Nearest Neighbor) and, even though I had made the tree a polymorphic one to not waste any amount of memory, this wouldn't scale above 10 dimensions.

Moreover, the paper mentions kd-trees. Notice, that when dimensions go high, the query time is no longer O(logn), but it becomes slightly less than brute force approach (i.e. check all points). The higher the dimensions, the worse kd-trees will perform.

A kd-tree is actually a binary tree, embedded in geometry. By that, I mean, that every node has two children and at every level, we halve the dataset (usually in the median of the coordinate with the greatest variance, so that we can exploit the structure of the dataset). And this will result into a perfect tree.

enter image description here Here you can see a kd-tree, a friend of mine made, of 64 points in 8D. In this version, we store 4 points per leaf.

The numbers in the boxes refer to the point number (starting with 1, i.e. line numbers in test.points file).

The notation "8 @ 0.532" refers to an inner node, where the data is split at 0.532 along the eighth dimension (again, dimensions starting with 1, for easier human understanding).

That's why, we tend our interest in approximate NN, which means that we pay some loss in accuracy, but we obtain some speedup. (As you may know, everything is a trade-off).

By Box, it probably means a minimum bounding box.

This is simple and here is an example:

Suppose you have, in 2D, this dataset:

-1 -2
 0  5
 8 -5

In order to construct the Bounding box, we need to find the minimum and the maximum coordinate in every dimension. Note, that for storing the Boudning box, it is enough to store its min and max corner.

Here, we have min = (-1, -5) and max = (8, 5). The bounding box is then, the rectangle, formed, in clockwise order -starting from max corner, the one that has as corners:

( 8,  5)  // ( max.x, max.y)
( 8, -5)  // ( max.x, min.y)
(-1, -5)  // ( min.x, min.y)
(-1,  5)  // ( min.x, max.y)

Observe, that all the points of the dataset, lie inside this bounding box.

As for the paper, it's actually a lecture, not a paper. It doesn't explain how one should write the algorithm. Moreover, it doesn't provide any unique information, in order to try to find another .pdf, that explains in more details the .pdf in your link.

[EDIT] for the OP's comment.

1) Q: dequeue box B, containing representative point p

I would say, that dequeue, means extract the "first" element of the queue. Enqueue, means push back an element in the queue. The queue seams to hold Bounding boxes as elements.

2) Q: r = d(q,B)

Maybe, he means from the representative point the box contains. Not clear.

You can compute the (Euclidean) distance from the query point to the closest corner of the box, or to the representative of the box.

3) for all children B' of B containing points in P

P is the dataset. Every box, is partitioned in 8 sub-boxes at every level (in the case of octree).

4) Q: while dN >= (1+e)r do

Approximation error e, is actually, what we call epsilon. It is usually a parameter and it means, that when you check:

while delta >= r do

you are less strict and you do

while delta >= (1 + e)*r do

which means that you are going into the loop less times than the exact condition above. So, I think it says, to insert every sub-box of box B, in the queue. This is not so clever, IMHO.

About the last comment with e = 0.01, just do the math in the condition above. You will see that the answer is no, since as the link you posted state, e is a multiplicative factor.