Vectorization of min distance in kernel

Question

Vectorized approach: After diving into the vectorized implementation, it looked like a very interesting look-up problem, so if you are still interested in vectorized techniques to get the job done, here's an approach with bsxfun -

%// Scalars to replace their repeated usages in the code
[W,H,NC]= size(Kart1);
d3a = 1:NC;

%// Indices for Kart1 at the first level of the nested loops
BI = bsxfun(@plus,K1(:,2)+(K1(:,1)-1)*W,(d3a-1)*W*H);

%// Indices for Kart1 in 3x3 kernel around the indices obtained at first level
BIW3 = bsxfun(@plus,[-1 0 1]',[-W 0 W]); %//'
%// Linear indices for the minimum value of Kart1 in the 3x3 neighborhood 
[~,MI3] = min(sqrt(sum(Kart1(bsxfun(@plus,...
    BI,permute(BIW3(:),[3 2 1]))).^2,2)),[],3);
%// X-Y indices
[xShift1,yShift1] = ind2sub([3 3],MI3);

%// Get Kart1 values corresponding to the indices for the minimum values
KP1Coor = Kart1(bsxfun(@plus,...
    K1(:,2)+xShift1-2 +(K1(:,1)+yShift1-2-1)*W,(d3a-1)*W*H));

Benchmarking

I was able to test it out with GPU too using gpuArray from Parallel Computing Toolbox and then run some benchmarks with W = 1000, H = 1000 and used N as the datasize varying it with [1000 2000 5000 10000 20000]. The results seem crazy enough, though not unreliable as approved benchmarking methods were used from Measure and Improve GPU Performance. Here is the benchmark plot for original and CPU and GPU vectorized codes -

enter image description here

It looked appropriate to then benchmark only the vectorized codes and for even larger datasizes, whose plot is shown next -

enter image description here

Conclusions: The problem basically looks like a look-up problem, where Kart1 is the data and K1 is the array of indices to look-up for. The vectorized solution presented here is basically a brute-force approach and the benchmark results clearly favor this for performance. But, it would be interesting to see if any non-brute-force approach that might be even loop-based, but leverages this looking up efficiently could perform better.