Scipy.Spatial.KDTree.query - large dataset issues

https://stackoverflow.com/questions/15482767

24-03-2022
|

Вопрос

I am playing with the KDQuery function in SciPy.Spatial. I have an issue once my data sizes get really large. I realize that the algorithm is not necessarily designed to be efficient for large datasets, but it appears (from the source) that size should only increase processing time, not impact output.

Here is a code snippet:

sizes = [ 10**i for i in range(5,6) ] #10^5 for this test
data = np.random.random_integers(0,100,(sizes[-1],2))
for size in sizes:
    kd = ps.common.KDTree(data)
    nnq = kd.query(data,k=2+1, p=2)
    info = nnq[1] #This is the indices of the neighbors
    neighbors = {}
    idset = np.arange(len(info)) #Indices of the input point
    for i, row in enumerate(info):
        row = row.tolist()
        row.remove(i)
        neighbors[idset[i]] = list(row)

This returns a value error when i is not in the list (ValueError list.remove(x): x not in list). For data sizes less than 10^5 this code works as expected.

One potential reason for the error is that the recursion limit is being reached. To explore this I set the recursion depth to 1,000,000 (sys.setrecursionlimit(1000000)). This does not alleviate the problem.

Решение

The error occurs in your code, at the statement row.remove(i). The problem is that your random data set can have duplicate points, and sometimes the same point can be repeated more than three times. This becomes very likely when the data set is large. When that happens, the three nearest neighbors of a point might not include the point itself. That causes the error in row.remove(i).

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow