Domanda

I need to quickly compute a matrix whose entries are obtained by convolving a filter with a vector for each row, subsampling the entries of the resulting vector, and then taking the dot product of the result with another vector. Specifically, I want to compute

M = [conv(e_j, f)*P_i*v_i ]_{i,j},

where i varies from 1 to n and j varies from 1 to m. Here e_j is the indicator (row) vector of size n with a one only in column j, f is the filter of length s, P_i is an (n+s-1)-by-k matrix which samples the appropriate k entries from the convolution, and v_i is a column vector of length k.

It takes O(n*s) operations to compute each entry of M, so O(n*s*n*m) overall to compute M. For n=6, m=7, s=3, one core of my computer (8GLOPs) should be able compute M in roughly .094 microseconds. Yet my very simple cython implementation, following the example given in the Cython documentation, takes more than 2 milliseconds to compute an example with those parameters. That is about 4 orders of magnitude difference!

Here is a shar file with the Cython implementation and test code. Copy and paste it to a file and run 'bash <fname>' in a clean directory to get the code, then run 'bash ./test.sh' to see the abysmal performance.

cat > fastcalcM.pyx <<'EOF'

import numpy as np
cimport numpy as np
cimport cython
from scipy.signal import convolve

DTYPE=np.float32
ctypedef np.float32_t DTYPE_t

@cython.boundscheck(False)
def calcM(np.ndarray[DTYPE_t, ndim=1, negative_indices=False] filtertaps, int
        n, int m, np.ndarray[np.int_t, ndim=2, negative_indices=False]
        keep_indices, np.ndarray[DTYPE_t, ndim=2, negative_indices=False] V):
    """ Computes a numrows-by-k matrix M whose entries satisfy
        M_{i,k} = [conv(e_j, f)^T * P_i * v_i],
        where v_i^T is the i-th row of V, and P_i samples the entries from
        conv(e_j, f)^T indicated by the ith row of the keep_indices matrix """

    cdef int k = keep_indices.shape[1]

    cdef np.ndarray M = np.zeros((n, m), dtype=DTYPE)
    cdef np.ndarray ej = np.zeros((m,), dtype=DTYPE)
    cdef np.ndarray convolution
    cdef int rowidx, colidx, kidx

    for rowidx in range(n):
        for colidx in range(m):
            ej[colidx] = 1
            convolution = convolve(ej, filtertaps, mode='full')
            for kidx in range(k):
                M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
            ej[colidx] = 0

    return M

EOF
#-----------------------------------------------------------------------------
cat > test_calcM.py << 'EOF'

import numpy as np
from fastcalcM import calcM

filtertaps = np.array([-1, 2, -1]).astype(np.float32)
n, m = 6, 7
keep_indices = np.array([[1, 3], 
                         [4, 5],
                         [2, 2], 
                         [5, 5], 
                         [3, 4], 
                         [4, 5]]).astype(np.int)
V = np.random.random_integers(-5, 5, size=(6, 2)).astype(np.float32)

print calcM(filtertaps, n, m, keep_indices, V)

EOF
#-----------------------------------------------------------------------------
cat > test.sh << 'EOF'

python setup.py build_ext --inplace
echo -e "%run test_calcM\n%timeit calcM(filtertaps, n, m, keep_indices, V)" > script.ipy
ipython script.ipy

EOF
#-----------------------------------------------------------------------------
cat > setup.py << 'EOF'

from distutils.core import setup
from Cython.Build import cythonize
import numpy

setup(
    name="Fast convolutions",
    include_dirs = [numpy.get_include()],
    ext_modules = cythonize("fastcalcM.pyx")
)

EOF

I thought maybe the call to scipy's convolve might be the culprit (I'm not certain that cython and scipy play well together), so I implemented my own convolution code ala the same example in Cython documentation, but this resulted in the overall code being about 10 times slower.

Any ideas on how to get closer to the theoretically possible speed, or reasons why the difference is so great?

È stato utile?

Soluzione

For one thing, the typing of M, eg and convolution doesn't allow fast indexing. The typing you've done is not particularly helpful at all, actually.

But it doesn't matter, because you have two overheads. The first is converting between Cython and Python types. You should keep untyped arrays around if you want to pass them to Python a lot, to prevent the need to convert. Just moving this to Python sped it up for that reason (1ms → 0.65μs).

Then I profiled it:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           def calcM(filtertaps, n, m, keep_indices, V):
    16      4111         3615      0.9      0.1      k = keep_indices.shape[1]
    17      4111         8024      2.0      0.1      M = np.zeros((n, m), dtype=np.float32)
    18      4111         6090      1.5      0.1      ej = np.zeros((m,), dtype=np.float32)
    19                                           
    20     28777        18690      0.6      0.3      for rowidx in range(n):
    21    197328       123284      0.6      2.2          for colidx in range(m):
    22    172662       112348      0.7      2.0              ej[colidx] = 1
    23    172662      4076225     23.6     73.6              convolution = convolve(ej, filtertaps, mode='full')
    24    517986       395513      0.8      7.1              for kidx in range(k):
    25    345324       668309      1.9     12.1                  M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
    26    172662       120271      0.7      2.2              ej[colidx] = 0
    27                                           
    28      4111         2374      0.6      0.0      return M

Before you consider anything else, deal with convolve.

Why is convolve slow? Well, it's got a lot of overhead. numpy/scipy normally does; it's best for large datasets. If you know the size of your array is going to stay small, just reimplement convolve in Cython.

Oh, try to use the buffer syntax. Use DTYPE[:, :] for a 2D array, DTYPE[:] for a 1D array, etc. It's the memoryview protocol, and it's way better. There are cases where it has more overhead, but those are typically possible to work around and it's way better in most other ways.


EDIT:

You can try (recursively) inlining the scipy function:

import numpy as np
from scipy.signal.sigtools import _correlateND

def calcM(filtertaps, n, m, keep_indices, V):
    k = keep_indices.shape[1]
    M = np.zeros((n, m), dtype=np.float32)
    ej = np.zeros((m,), dtype=np.float32)

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
    sliced_filtertaps_view = filtertaps[slice_obj]

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
    in1zpadded = np.zeros(ps, ej.dtype)
    out = np.empty(ps, ej.dtype)

    for rowidx in range(n):
        for colidx in range(m):
            in1zpadded[colidx] = 1

            convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)

            for kidx in range(k):
                M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]

            in1zpadded[colidx] = 0

    return M

Note that this uses private implementation details.

This is tailored for the particular dimensions, so I don't know if it'll work on your actual data. But it removes the vast majority of overhead. You can then improve this by typing things again:

import numpy as np
cimport numpy as np
from scipy.signal.sigtools import _correlateND

DTYPE=np.float32
ctypedef np.float32_t DTYPE_t

def calcM(filtertaps, int n, int m, np.int_t[:, :] t_keep_indices, DTYPE_t[:, :] t_V):
    cdef int rowidx, colidx, kidx, k
    cdef DTYPE_t[:, :] t_M
    cdef DTYPE_t[:] t_in1zpadded, t_convolution

    k = t_keep_indices.shape[1]
    t_M = M = np.zeros((n, m), dtype=np.float32)
    ej = np.zeros((m,), dtype=np.float32)

    slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
    sliced_filtertaps_view = filtertaps[slice_obj]

    ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
    t_in1zpadded = in1zpadded = np.zeros(ps, ej.dtype)
    out = np.empty(ps, ej.dtype)

    for rowidx in range(n):
        for colidx in range(m):
            t_in1zpadded[colidx] = 1

            t_convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)

            for kidx in range(k):
                t_M[rowidx, colidx] += t_convolution[<int>t_keep_indices[rowidx, kidx]] * t_V[rowidx, kidx]

            t_in1zpadded[colidx] = 0

    return M

It's over 10x as fast, but not as high as your pie-in-the-sky estimate. Then again, that calculation was a bit bogus to begin with ;).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top