Java: distance metric algorithm design

https://stackoverflow.com/questions/10980741

13-06-2021
|

Question

i am trying to work out the following problem in Java (although it could be done in pretty much any other language):

I am given two arrays of integer values, xs and ys, representing dataPoints on the x-axis. Their length might not be identical, though both are > 0, and they need not be sorted. What I want to calculate is the minimum distance measure between two data set of points. What I mean by that is, for each x I find the closest y in the set of ys and calculate distance, for instance (x-y)^2. For instance:

xs = [1,5]
ys = [10,4,2]

should return (1-2)^2 + (5-4)^2 + (5-10)^2

Distance measure is not important, it's the algorithm I am interested in. I was thinking about sorting both arrays and advance indexes in both of these arrays somehow to achieve something better than bruteforce (for each elem in x, scan all elems in ys to find min) which is O(len1 * len2).

This is my own problem I am working on, not a homework question. All your hints would be greatly appreciated.

Solution

I assume that HighPerformanceMark (first comment on your question) is right and you actually take the larger array, find for each element the closest one of the smaller array and sum up some f(dist) over those distances.

I would suggest your approach:

Sort both arrays 
indexSmall=0 

// sum up
for all elements e in bigArray {
  // increase index as long as we get "closer"
  while (dist(e,smallArray(indexSmall)) > dist(e,smallArray(indexSmall+1)) {
    indexSmall++
  }
  sum += f(dist(e,smallArray(indexSmall)));
}

which is O(max(len1,len2)*log(max(len1,len2))) for the sorting. The rest is linear to the larger array length. Now dist(x,y) would be something like abs(x-y), and f(d)=d^2 or whatever you want.

OTHER TIPS

You're proposed idea sounds good to me. You can sort the lists in O(n logn) time. Then you can do a single iteration over the longer list using a sliding index on the other to find the "pairs". As you progress through the longer list, you will never have to backtrack on the other. So now your whole algorithm is O(n logn + n) = O(n logn).

Your approach is pretty good and has O(n1*log(n1)+n2*log(n2)) time complexity.

If the arrays has different lengths, another approach is to:

sort the shorter array;
traverse the longer array from start to finish, using binary search to locate the nearest item in the sorted short array.

This has O((n1+n2)*log(n1)) time complexity, where n1 is the length of the shorter array.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow