How to find k nearest neighbors to the median of n distinct numbers in O(n) time?

https://stackoverflow.com/questions/1557678

21-09-2019
|

Question

I can use the median of medians selection algorithm to find the median in O(n). Also, I know that after the algorithm is done, all the elements to the left of the median are less that the median and all the elements to the right are greater than the median. But how do I find the k nearest neighbors to the median in O(n) time?

If the median is n, the numbers to the left are less than n and the numbers to the right are greater than n. However, the array is not sorted in the left or the right sides. The numbers are any set of distinct numbers given by the user.

The problem is from Introduction to Algorithms by Cormen, problem 9.3-7

Solution 8

Actually, the answer is pretty simple. All we need to do is to select k elements with the smallest absolute differences from the median moving from m-1 to 0 and m+1 to n-1 when the median is at index m. We select the elements using the same idea we use in merging 2 sorted arrays.

OTHER TIPS

No one seems to quite have this. Here's how to do it. First, find the median as described above. This is O(n). Now park the median at the end of the array, and subtract the median from every other element. Now find element k of the array (not including the last element), using the quick select algorithm again. This not only finds element k (in order), it also leaves the array so that the lowest k numbers are at the beginning of the array. These are the k closest to the median, once you add the median back in.

The median-of-medians probably doesn't help much in finding the nearest neighbours, at least for large n. True, you have each column of 5 partitioned around it's median, but this isn't enough ordering information to solve the problem.

I'd just treat the median as an intermediate result, and treat the nearest neighbours as a priority queue problem...

Once you have the median from the median-of-medians, keep a note of it's value.

Run the heapify algorithm on all your data - see Wikipedia - Binary Heap. In comparisons, base the result on the difference relative to that saved median value. The highest priority items are those with the lowest ABS(value - median). This takes O(n).

The first item in the array is now the median (or a duplicate of it), and the array has heap structure. Use the heap extract algorithm to pull out as many nearest-neighbours as you need. This is O(k log n) for k nearest neighbours.

So long as k is a constant, you get O(n) median of medians, O(n) heapify and O(log n) extracting, giving O(n) overall.

med=Select(A,1,n,n/2)   //finds the median

for i=1 to n
   B[i]=mod(A[i]-med)

q=Select(B,1,n,k) //get the kth smallest difference

j=0
for i=1 to n
   if B[i]<=q 
     C[j]=A[i] //A[i], the real value should be assigned instead of B[i] which is only the difference between A[i] and median.
       j++
return C

You can solve your problem like that:

You can find the median in O(n), w.g. using the O(n) nth_element algorithm.

You loop through all elements substutiting each with a pair:

the absolute difference to the median, element's value.

Once more you do nth_element with n = k. after applying this algorithm you are guaranteed to have the k smallest elements in absolute difference first in the new array. You take their indices and DONE!

Four Steps:

First find the median (Median of median) - O(n)
Determine the absolute difference between median and each of the elements - O(n)
Use the kth smallest element algorithm to get the result(Quickselect) - O(n)
Now we need to pick the k closest out of the array - O(n)

You could use a non-comparison sort, such as a radix sort, on the list of numbers L, then find the k closest neighbors by considering windows of k elements and examining the window endpoints. Another way of stating "find the window" is find i that minimizes abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i] - L[n/2]) (if k is odd) or abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+1] - L[n/2]) (if k is even). Combining the cases, abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+!(k&1)] - L[n/2]). A simple, O(k) way of finding the minimum is to start with i=0, then slide to the left or right, but you should be able to find the minimum in O(log(k)).

The expression you minimize comes from transforming L into another list, M, by taking the difference of each element from the median.

m=L[n/2]
M=abs(L-m)

i minimizes M[n/2-k/2+i] + M[n/2+k/2+i].

You already know how to find the median in O(n)

if the order does not matter, selection of k smallest can be done in O(n) apply for k smallest to the rhs of the median and k largest to the lhs of the median

from wikipedia

 function findFirstK(list, left, right, k)
 if right > left
     select pivotIndex between left and right
     pivotNewIndex := partition(list, left, right, pivotIndex)
     if pivotNewIndex > k  // new condition
         findFirstK(list, left, pivotNewIndex-1, k)
     if pivotNewIndex < k
         findFirstK(list, pivotNewIndex+1, right, k)

don't forget the special case where k==n return the original list

Find the median in O(n). 2. create a new array, each element is the absolute value of the original value subtract the median 3. Find the kth smallest number in O(n) 4. The desired values are the elements whose absolute difference with the median is less than or equal to the kth smallest number in the new array.

If you know the index of the median, which should just be ceil(array.length/2) maybe, then it just should be a process of listing out n(x-k), n(x-k+1), ... , n(x), n(x+1), n(x+2), ... n(x+k) where n is the array, x is the index of the median, and k is the number of neighbours you need.(maybe k/2, if you want total k, not k each side)

First select the median in O(n) time, using a standard algorithm of that complexity. Then run through the list again, selecting the elements that are nearest to the median (by storing the best known candidates and comparing new values against these candidates, just like one would search for a maximum element).

In each step of this additional run through the list O(k) steps are needed, and since k is constant this is O(1). So the total for time needed for the additional run is O(n), as is the total runtime of the full algorithm.

Since all the elements are distinct, there can be atmost 2 elements with the same difference from the mean. I think it is easier for me to have 2 arrays A[k] and B[k] the index representing the absolute value of the difference from the mean. Now the task is to just fill up the arrays and choose k elements by reading the first k non empty values of the arrays reading A[i] and B[i] before A[i+1] and B[i+1]. This can be done in O(n) time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow