Vectorized way of calculating row-wise dot product two matrices with Scipy

Question 1

Check out numpy.einsum for another method:

In [52]: a
Out[52]: 
array([[1, 2, 3],
       [3, 4, 5]])

In [53]: b
Out[53]: 
array([[1, 2, 3],
       [1, 2, 3]])

In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])

Looks like einsum is a bit faster than inner1d:

In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop

In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop

In [96]: a = random.randn(10, 100)

In [97]: b = random.randn(10, 100)

In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop

In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop

Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.

Question 2

Straightforward way to do that is:

import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)

which avoids the python loop and is faster in cases like:

def npsumdot(x, y):
    return np.sum(x*y, axis=1)

def loopdot(x, y):
    result = np.empty((x.shape[0]))
    for i in range(x.shape[0]):
        result[i] = np.dot(x[i], y[i])
    return result

timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop

Question 3

Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use

numpy.einsum("ij,ij->i", a, b)

Even better is to align your memory such that the summation happens in the first dimension, e.g.,

a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)

numpy.einsum("ij,ij->j", a, b)

For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.

Note also that the transposed summation is much faster than its untransposed equivalent.

The plot was created with perfplot (a small project of mine)

import numpy
from numpy.core.umath_tests import inner1d
import perfplot


def setup(n):
    a = numpy.random.rand(n, 3)
    b = numpy.random.rand(n, 3)
    aT = numpy.ascontiguousarray(a.T)
    bT = numpy.ascontiguousarray(b.T)
    return (a, b), (aT, bT)


b = perfplot.bench(
    setup=setup,
    n_range=[2 ** k for k in range(1, 25)],
    kernels=[
        lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
        lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
        lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
        lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
        lambda data: inner1d(data[0][0], data[0][1]),
    ],
    labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
    xlabel="len(a), len(b)",
)

b.save("out1.png")
b.save("out2.png", relative_to=3)

Question 4

You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.

import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
    result[i] = np.dot(a[i],b[i]))
print result

EDIT

Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.

from numpy.core.umath_tests import inner1d
inner1d(a,b)  # array([14, 26])

Question 5

I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.

For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.

My large matrix example:

import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000)  # 800 MB each
b = np.random.randn(100, 1000000)  # pretty big.

def loop_dot(a, b):
    result = np.empty((a.shape[1],))
    for i, (row1, row2) in enumerate(zip(a, b)):
        result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b)  # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!