PyCUDA precision of matrix multiplication code

https://stackoverflow.com/questions/21130121

28-09-2022
|

Frage

I am trying to learn CUDA and using PyCUDA to write a simple matrix multiplication code. For two 4x4 randomly generated matrices I get the following solution:

Cuda:
[[ -5170.86181641 -21146.49609375  20690.02929688 -35413.9296875 ]
 [-18998.5         -3199.53271484  13364.62890625   7141.36816406]
 [ 31197.43164062  21095.02734375   1750.64453125  11304.63574219]
 [  -896.64978027  18424.33007812 -17135.00390625   7418.28417969]]

Python:
[[ -5170.86035156 -21146.49609375  20690.02929688 -35413.9296875 ]
 [-18998.5         -3199.53271484  13364.62695312   7141.36816406]
 [ 31197.43164062  21095.02929688   1750.64404297  11304.63574219]
 [  -896.64941406  18424.33007812 -17135.00390625   7418.28417969]]

Cuda-Python:
[[-0.00146484  0.          0.          0.        ]
 [ 0.          0.          0.00195312  0.        ]
 [ 0.         -0.00195312  0.00048828  0.        ]
 [-0.00036621  0.          0.          0.        ]]

The error is of the order of 1e-3 and increases as I increase the size of matrices. I am not sure whether its a bug or not. My question is whether such a "large" error is a normal thing with int32 or I am doing something wrong?

Here is the source code:

matmul.py

import numpy as np
import time
# import pycuda stuff
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

BLOCK_SIZE = 16

n = 4
ni = np.int32(n)

# matrix A 
a = np.random.randn(n, n)*100
a = a.astype(np.float32)

# matrix B
b = np.random.randn(n, n)*100
b = b.astype(np.float32)

# matrix B
c = np.empty([n, n])
c = c.astype(np.float32)

# allocate memory on device
a_gpu = cuda.mem_alloc(a.nbytes)
b_gpu = cuda.mem_alloc(b.nbytes)
c_gpu = cuda.mem_alloc(c.nbytes)

# copy matrix to memory
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)

# compile kernel
mod = SourceModule(open("kernels.cu", "r").read())

# get function
matmul = mod.get_function("matmul");


# set grid size
if n%BLOCK_SIZE != 0:
    grid=(n/BLOCK_SIZE+1,n/BLOCK_SIZE+1,1)
else:
    grid=(n/BLOCK_SIZE,n/BLOCK_SIZE,1)

# call gpu function
start = time.time()
matmul(ni, a_gpu, b_gpu, c_gpu, block=(BLOCK_SIZE,BLOCK_SIZE,1), grid=grid);
end = time.time()
print "Time: %.5f s"%(end-start)

# copy back the result
cuda.memcpy_dtoh(c, c_gpu)

print np.linalg.norm(c - np.dot(a,b))
print c
print np.dot(a,b)
print c - np.dot(a,b)

kernels.cu

__global__ void matmul(int n, const float *A, const float *B, float *C){

  int tx = threadIdx.x;
  int ty = threadIdx.y;

  int bx = blockIdx.x;
  int by = blockIdx.y;

  int row = by*blockDim.y + ty;
  int col = bx*blockDim.x + tx;

  if(row < n && col < n){
    float val = 0.0;
    for(int i=0; i<n; ++i){
      val += A[row*n + i]*B[n*i + col];
    }
    C[row*n + col] = val;
  }
}

Lösung

Just adding to what Warren has said. I don't think there's anything wrong here. You're comparing the floating point results generated by two different machines (CPU and GPU). They are not guaranteed to be bitwise identical for the operations at the level you are thinking about, partly because the order of operations on the GPU is not necessarily the same as the order of operations on the GPU. As you increase the size of the matrices, you are increasing the number of values summed together, and your absolute error increases, since you are adding a bunch of small bit errors together.

In general, these types of considerations should always come into play when comparing floating point results. Bitwise identical results are rarely to be expected from two different computations. And even something as simple as changing the order of operations can make it a different calculation, for floating point operations. You may want to read this paper especially section 2.2.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow