The problem is you are using vectorize
on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize
is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.
Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).
from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer
def generate_input(n):
A = np.array(np.random.sample((n,n)), dtype=np.float32)
B = np.array(np.random.sample(n), dtype=A.dtype)
return A, B
@cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
height, width = c.shape
for y in range(startY, height, gridY):
for x in range(startX, width, gridX):
c[y, x] = a[y, x] * b[x]
def main():
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape, dtype=A.dtype)
E = np.zeros(A.shape, dtype=A.dtype)
F = np.empty(A.shape, dtype=A.dtype)
start = timer()
E = np.dot(A, np.diag(B))
numpy_time = timer() - start
blas = cublas.api.Blas()
start = timer()
blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
cublas_time = timer() - start
diff = np.abs(D-E)
print("Maximum CUBLAS error %f" % np.max(diff))
blockdim = (32, 8)
griddim = (16, 16)
start = timer()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dF = cuda.to_device(F, copy=False)
diagproduct[griddim, blockdim](dF, dA, dB)
dF.to_host()
cuda_time = timer() - start
diff = np.abs(F-E)
print("Maximum CUDA error %f" % np.max(diff))
print("Numpy took %f seconds" % numpy_time)
print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time))
print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))
if __name__ == '__main__':
main()
The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct
function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.024535 seconds
CUBLAS took 0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup
The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 7.245677 seconds
CUBLAS took 1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup
For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.006876 seconds
CUBLAS took 0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup