Numpy sum of operator results without allocating an unnecessary array

Question 1

On my machine this is faster:

(a == b).sum()

If you don't want to use any extra storage, than I would suggest using numba. I'm not too familiar with it, but this seems to work well. I ran into some trouble getting Cython to take a boolean NumPy array.

from numba import autojit
def pysumeq(a, b):
    tot = 0
    for i in xrange(a.shape[0]):
        for j in xrange(a.shape[1]):
            if a[i,j] == b[i,j]:
                tot += 1
    return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)

If you don't have numba, I would suggest using the answer by @user2357112

Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.

from numpy cimport ndarray as ar
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
    cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
    for i in xrange(h):
        for j in xrange(w):
            if a[i,j] == b[i,j]:
                tot += 1
    return tot

Question 2

To start with you can skip then A*B step:

>>> a
array([ True, False,  True, False,  True], dtype=bool)
>>> b
array([False,  True,  True, False,  True], dtype=bool)
>>> np.sum(~(a^b))
3

If you do not mind destroying array a or b, I am not sure you will get faster then this:

>>> a^=b   #In place xor operator
>>> np.sum(~a)
3

Question 3

If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:

out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()

This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.

Question 4

Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.

from numpy cimport ndarray as ar
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
    cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
    cdef np.uint8_t* adata = &a[0, 0]
    cdef np.uint8_t* bdata = &b[0, 0]
    for i in xrange(h):
        for j in xrange(w):
            if adata[j] == bdata[j]:
                tot += 1
        adata += w
        bdata += w
    return tot

This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s