How to implement a fast, lazy KDTree in Haskell?

https://stackoverflow.com/questions/20119826

03-08-2022
|

Question

I am trying to implement a kdtree in Haskell (see implementation) but I tried to be smart and utilize Haskells lazyness while implementing the nearest neighbour algorithm (see line 46).

While it is technically correct, that is:

minimumBy (compare `on` qd q) vs == head . nearestNeighbours (kdtree 5 vs) $ q
==> True

The kdtree based version is much slower (criterion: 5.38ms vs 138.44ms, with 500k datapoints). I first thought that this was due to too strict pattern matching in ordMerge (line 59), but I rewrote it and from my understanding bs should now only be evaluated as needed.

If this is the case the algorithm should only descend to the matching bucket, and reascend while checking that the current best nearest neighbor is really the best candidate.

I did some profiling and nearestNeighhbors is called about 800 times. Given a tree depth of 8 and 100 testcases this sounds about reasonable, doesn't it?

Just uploaded my code to github: https://github.com/fhaust/threesg

This should get you started:

git clone https://github.com/fhaust/threesg
cd threesg
cabal sandbox init
cabal install --enable-benchmarks --enable-tests
cabal test
cabal bench --benchmark-options="+RTS -K100M -RTS"

(-K100M is needed cause the testset is created from 500k points)

While creating a testset for github I noticed, that on normal distributed points, the kdtree search runs alot faster than the linear search ... probably my problem is not the algorithm ... but my testset :(

Solution

In the end this was a question of keeping track of the evaluation order. I uploaded the latest version on github.

Take a look at line 74: the second list is only evaluated if the first entry of the first list doesn't match the "there can't be a better candidate" criterion.

Apropos criterion, I did some benchmarking and the kd-tree ist indeed faster.

What do you think of this solution? I think the code is quite concise and readable. Are there any obvious performance penalties?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow