Vectorized approach: After diving into the vectorized implementation, it looked like a very interesting look-up problem
, so if you are still interested in vectorized techniques to get the job done, here's an approach with bsxfun
-
%// Scalars to replace their repeated usages in the code
[W,H,NC]= size(Kart1);
d3a = 1:NC;
%// Indices for Kart1 at the first level of the nested loops
BI = bsxfun(@plus,K1(:,2)+(K1(:,1)-1)*W,(d3a-1)*W*H);
%// Indices for Kart1 in 3x3 kernel around the indices obtained at first level
BIW3 = bsxfun(@plus,[-1 0 1]',[-W 0 W]); %//'
%// Linear indices for the minimum value of Kart1 in the 3x3 neighborhood
[~,MI3] = min(sqrt(sum(Kart1(bsxfun(@plus,...
BI,permute(BIW3(:),[3 2 1]))).^2,2)),[],3);
%// X-Y indices
[xShift1,yShift1] = ind2sub([3 3],MI3);
%// Get Kart1 values corresponding to the indices for the minimum values
KP1Coor = Kart1(bsxfun(@plus,...
K1(:,2)+xShift1-2 +(K1(:,1)+yShift1-2-1)*W,(d3a-1)*W*H));
Benchmarking
I was able to test it out with GPU too using gpuArray
from Parallel Computing Toolbox and then run some benchmarks with W = 1000
, H = 1000
and used N
as the datasize varying it with [1000 2000 5000 10000 20000]
. The results seem crazy enough, though not unreliable as approved benchmarking methods were used from Measure and Improve GPU Performance
. Here is the benchmark plot for original and CPU and GPU vectorized codes -
It looked appropriate to then benchmark only the vectorized codes and for even larger datasizes, whose plot is shown next -
Conclusions: The problem basically looks like a look-up problem, where Kart1
is the data and K1
is the array of indices to look-up for. The vectorized solution presented here is basically a brute-force approach and the benchmark results clearly favor this for performance. But, it would be interesting to see if any non-brute-force approach that might be even loop-based, but leverages this looking up efficiently could perform better.