Unexpectedly slow cython convolution code

Question

For one thing, the typing of M, eg and convolution doesn't allow fast indexing. The typing you've done is not particularly helpful at all, actually.

But it doesn't matter, because you have two overheads. The first is converting between Cython and Python types. You should keep untyped arrays around if you want to pass them to Python a lot, to prevent the need to convert. Just moving this to Python sped it up for that reason (1ms → 0.65μs).

Then I profiled it:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           def calcM(filtertaps, n, m, keep_indices, V):
    16      4111         3615      0.9      0.1      k = keep_indices.shape[1]
    17      4111         8024      2.0      0.1      M = np.zeros((n, m), dtype=np.float32)
    18      4111         6090      1.5      0.1      ej = np.zeros((m,), dtype=np.float32)
    19                                           
    20     28777        18690      0.6      0.3      for rowidx in range(n):
    21    197328       123284      0.6      2.2          for colidx in range(m):
    22    172662       112348      0.7      2.0              ej[colidx] = 1
    23    172662      4076225     23.6     73.6              convolution = convolve(ej, filtertaps, mode='full')
    24    517986       395513      0.8      7.1              for kidx in range(k):
    25    345324       668309      1.9     12.1                  M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
    26    172662       120271      0.7      2.2              ej[colidx] = 0
    27                                           
    28      4111         2374      0.6      0.0      return M

Before you consider anything else, deal with convolve.

Why is convolve slow? Well, it's got a lot of overhead. numpy/scipy normally does; it's best for large datasets. If you know the size of your array is going to stay small, just reimplement convolve in Cython.

Oh, try to use the buffer syntax. Use DTYPE[:, :] for a 2D array, DTYPE[:] for a 1D array, etc. It's the memoryview protocol, and it's way better. There are cases where it has more overhead, but those are typically possible to work around and it's way better in most other ways.

EDIT:

You can try (recursively) inlining the scipy function:

import numpy as np
from scipy.signal.sigtools import _correlateND

def calcM(filtertaps, n, m, keep_indices, V):
    k = keep_indices.shape[1]
    M = np.zeros((n, m), dtype=np.float32)
    ej = np.zeros((m,), dtype=np.float32)

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
    sliced_filtertaps_view = filtertaps[slice_obj]

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
    in1zpadded = np.zeros(ps, ej.dtype)
    out = np.empty(ps, ej.dtype)

    for rowidx in range(n):
        for colidx in range(m):
            in1zpadded[colidx] = 1

            convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)

            for kidx in range(k):
                M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]

            in1zpadded[colidx] = 0

    return M

Note that this uses private implementation details.

This is tailored for the particular dimensions, so I don't know if it'll work on your actual data. But it removes the vast majority of overhead. You can then improve this by typing things again:

import numpy as np
cimport numpy as np
from scipy.signal.sigtools import _correlateND

DTYPE=np.float32
ctypedef np.float32_t DTYPE_t

def calcM(filtertaps, int n, int m, np.int_t[:, :] t_keep_indices, DTYPE_t[:, :] t_V):
    cdef int rowidx, colidx, kidx, k
    cdef DTYPE_t[:, :] t_M
    cdef DTYPE_t[:] t_in1zpadded, t_convolution

    k = t_keep_indices.shape[1]
    t_M = M = np.zeros((n, m), dtype=np.float32)
    ej = np.zeros((m,), dtype=np.float32)

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
    sliced_filtertaps_view = filtertaps[slice_obj]

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
    t_in1zpadded = in1zpadded = np.zeros(ps, ej.dtype)
    out = np.empty(ps, ej.dtype)

    for rowidx in range(n):
        for colidx in range(m):
            t_in1zpadded[colidx] = 1

            t_convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)

            for kidx in range(k):
                t_M[rowidx, colidx] += t_convolution[<int>t_keep_indices[rowidx, kidx]] * t_V[rowidx, kidx]

            t_in1zpadded[colidx] = 0

    return M

It's over 10x as fast, but not as high as your pie-in-the-sky estimate. Then again, that calculation was a bit bogus to begin with ;).