Question

I like to find the item of DF2 that is cloest to the item in DF1.

The distance is euclidean distance.

For example, for A in DF1, F in DF2 is the cloeset one.

>>> DF1
   X  Y name
0  1  2    A
1  3  4    B
2  5  6    C
3  7  8    D
>>> DF2
   X  Y name
0  3  8    E
1  2  4    F
2  1  9    G
3  6  4    H

My code is

DF1 = pd.DataFrame({'name' : ['A', 'B', 'C', 'D'],'X' : [1,3,5,7],'Y' : [2,4,6,8]})
DF2 = pd.DataFrame({'name' : ['E', 'F', 'G', 'H'],'X' : [3,2,1,6],'Y' : [8,4,9,4]})


def ndis(row):
    try:
        X,Y=row['X'],row['Y']
        DF2['DIS']=(DF2.X-X)*(DF2.X-X)+(DF2.Y-Y)*(DF2.Y-Y)
        temp=DF2.ix[DF2.DIS.idxmin()]
        return temp[2]  #       print temp[2]
    except:
        pass        


DF1['Z']=DF1.apply(ndis, axis=1)

This works fine, and it will take too long for large data set.

Another question is to how to find the 2nd and 3d cloeset ones.

Was it helpful?

Solution

There is more than one approach, for example one can use numpy:

>>> xy = ['X', 'Y']
>>> distance_array = numpy.sum((df1[xy].values - df2[xy].values)**2, axis=1)
>>> distance_array.argmin()
1

Top 3 closest (not the fastest approach, I suppose, but simplest)

>>> distance_array.argsort()[:3]
array([1, 3, 2])

If speed is a concern, run performance tests.

OTHER TIPS

Look at scipy.spatial.KDTree and the related cKDTree, which is faster but offers only a subset of the functionality. For large sets, you probably won't beat that for speed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top