cython memoryview slower than expected

https://stackoverflow.com/questions/12800121

06-07-2021
|

Question

I've started using memoryviews in cython to access numpy arrays. One of the various advantages they have is that they are considerably faster than the old numpy buffer support: http://docs.cython.org/src/userguide/memoryviews.html#comparison-to-the-old-buffer-support

However, I have an example where the old numpy buffer support is faster than memoryviews! How can this be?! I wonder if I'm using memoryviews correctly?

This is my test:

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[np.uint8_t, ndim=2] image_box1(np.ndarray[np.uint8_t, ndim=2] im, 
                                               np.ndarray[np.float64_t, ndim=1] pd,  
                                               int box_half_size):
    cdef unsigned int p0 = <int>(pd[0] + 0.5)  
    cdef unsigned int p1 = <int>(pd[1] + 0.5)    
    cdef unsigned int top = p1 - box_half_size
    cdef unsigned int left = p0 - box_half_size
    cdef unsigned int bottom = p1 + box_half_size
    cdef unsigned int right = p0 + box_half_size    
    cdef np.ndarray[np.uint8_t, ndim=2] box = im[top:bottom, left:right] 
    return box 

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.uint8_t[:, ::1] image_box2(np.uint8_t[:, ::1] im, 
                                    np.float64_t[:] pd,  
                                    int box_half_size):

    cdef unsigned int p0 = <int>(pd[0] + 0.5)  
    cdef unsigned int p1 = <int>(pd[1] + 0.5)    
    cdef unsigned int top = p1 - box_half_size
    cdef unsigned int left = p0 - box_half_size
    cdef unsigned int bottom = p1 + box_half_size
    cdef unsigned int right = p0 + box_half_size     
    cdef np.uint8_t[:, ::1] box = im[top:bottom, left:right]   
    return box

The timing results are:

image_box1: typed numpy: 100000 loops, best of 3: 11.2 us per loop

image_box2: memoryview: 100000 loops, best of 3: 18.1 us per loop

These measurements are done from IPython using %timeit image_box1(im, pd, box_half_size)

Solution

Alright! I found the problem. As seberg pointed out the memoryviews appeared slower because the measurement included the automatic conversion from numpy array to memoryview.

I used the following function to measure the times from within the cython module:

def test(params):   
    import timeit
    im = params[0]
    pd = params[1]
    box_half_size = params[2]
    t1 = timeit.Timer(lambda: image_box1(im, pd, box_half_size))
    print 'image_box1: typed numpy:'
    print min(t1.repeat(3, 10))
    cdef np.uint8_t[:, ::1] im2 = im
    cdef np.float64_t[:] pd2 = pd
    t2 = timeit.Timer(lambda: image_box2(im2, pd2, box_half_size))
    print 'image_box2: memoryview:'
    print min(t2.repeat(3, 10))

result:

image_box1: typed numpy: 9.07607864065e-05

image_box2: memoryview: 5.81799904467e-05

So memoryviews are indeed faster!

Note that I converted im and pd to memoryviews before calling image_box2. If I don't do this step and I pass im and pd directly, then image_box2 is slower:

image_box1: typed numpy: 9.12262257771e-05

image_box2: memoryview: 0.000185245087778

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow