I can't really help you with the strides approach, but do have a method that should be faster than your original code. It loops over the tool
array instead of over the base
array, meaning, however not fully vectorized, a lot more work is pushed to numpy.
Note that in your original code I changed the ranges and switched the widths and heights, because I assume that is what you intended..
import numpy as np
height, width = 500, 500
toolh, toolw = 6, 6
base = np.random.rand(height, width)
tool = np.random.rand(toolh, toolw)
m, n = height-toolh+1, width-toolw+1
def height_diff_old(base, tool):
zdiff = np.empty((m, n))
for i in range(m):
for j in range(n):
zdiff[i, j] = (tool - base[i:i+toolh, j:j+toolw]).min()
return zdiff
def height_diff_new(base, tool):
zdiff = np.empty((m, n))
zdiff.fill(np.inf)
for i in range(toolh):
for j in range(toolw):
diff_ij = tool[i, j] - base[i:i+m, j:j+n]
np.minimum(zdiff, diff_ij, out=zdiff)
return zdiff
Of course you'd want to calculate the heights and widths in your actual function, but for testing it was easier having them as globals.
For the given array sizes the original code runs in 7.38 seconds while the new code takes only 206 milliseconds on my system. I assume the new code is faster for your array sizes as well but I'm not sure how well it scales :)
Other alternatives that may or may not be of interest for you are using Numba or Cython, which in many cases should be faster than any "vectorized" numpy code you think of..