cython memoryview slower than expected
-
06-07-2021 - |
Question
I've started using memoryviews in cython to access numpy arrays. One of the various advantages they have is that they are considerably faster than the old numpy buffer support: http://docs.cython.org/src/userguide/memoryviews.html#comparison-to-the-old-buffer-support
However, I have an example where the old numpy buffer support is faster than memoryviews! How can this be?! I wonder if I'm using memoryviews correctly?
This is my test:
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[np.uint8_t, ndim=2] image_box1(np.ndarray[np.uint8_t, ndim=2] im,
np.ndarray[np.float64_t, ndim=1] pd,
int box_half_size):
cdef unsigned int p0 = <int>(pd[0] + 0.5)
cdef unsigned int p1 = <int>(pd[1] + 0.5)
cdef unsigned int top = p1 - box_half_size
cdef unsigned int left = p0 - box_half_size
cdef unsigned int bottom = p1 + box_half_size
cdef unsigned int right = p0 + box_half_size
cdef np.ndarray[np.uint8_t, ndim=2] box = im[top:bottom, left:right]
return box
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.uint8_t[:, ::1] image_box2(np.uint8_t[:, ::1] im,
np.float64_t[:] pd,
int box_half_size):
cdef unsigned int p0 = <int>(pd[0] + 0.5)
cdef unsigned int p1 = <int>(pd[1] + 0.5)
cdef unsigned int top = p1 - box_half_size
cdef unsigned int left = p0 - box_half_size
cdef unsigned int bottom = p1 + box_half_size
cdef unsigned int right = p0 + box_half_size
cdef np.uint8_t[:, ::1] box = im[top:bottom, left:right]
return box
The timing results are:
image_box1: typed numpy: 100000 loops, best of 3: 11.2 us per loop
image_box2: memoryview: 100000 loops, best of 3: 18.1 us per loop
These measurements are done from IPython using %timeit image_box1(im, pd, box_half_size)
Solution
Alright! I found the problem. As seberg pointed out the memoryviews appeared slower because the measurement included the automatic conversion from numpy array to memoryview.
I used the following function to measure the times from within the cython module:
def test(params):
import timeit
im = params[0]
pd = params[1]
box_half_size = params[2]
t1 = timeit.Timer(lambda: image_box1(im, pd, box_half_size))
print 'image_box1: typed numpy:'
print min(t1.repeat(3, 10))
cdef np.uint8_t[:, ::1] im2 = im
cdef np.float64_t[:] pd2 = pd
t2 = timeit.Timer(lambda: image_box2(im2, pd2, box_half_size))
print 'image_box2: memoryview:'
print min(t2.repeat(3, 10))
result:
image_box1: typed numpy: 9.07607864065e-05
image_box2: memoryview: 5.81799904467e-05
So memoryviews are indeed faster!
Note that I converted im and pd to memoryviews before calling image_box2. If I don't do this step and I pass im and pd directly, then image_box2 is slower:
image_box1: typed numpy: 9.12262257771e-05
image_box2: memoryview: 0.000185245087778