Question

I've been working on a Fortran code which uses the cuBLAS batched LU and cuSPARSE batched tridiagonal solver as part of a BiCG iterative solver with ADI preconditioner.I'm using a Kepler K20X with compute capability 3.5 and CUDA 5.5. I'm doing this without PGI's CUDA Fortran, so I'm writing my own interfaces:

FUNCTION cublasDgetrfBatched(handle, n, dA, ldda, dP, dInfo, nbatch) BIND(C, NAME="cublasDgetrfBatched")
    USE, INTRINSIC :: ISO_C_BINDING
    INTEGER(KIND(CUBLAS_STATUS_SUCCESS)) :: cublasDgetrfBatched
    TYPE(C_PTR), VALUE :: handle
    INTEGER(C_INT), VALUE :: n
    TYPE(C_PTR), VALUE :: dA
    INTEGER(C_INT), VALUE :: ldda
    TYPE(C_PTR), VALUE :: dP
    TYPE(C_PTR), VALUE :: dInfo
    INTEGER(C_INT), VALUE :: nbatch
END FUNCTION cublasDgetrfBatched

I allocate pinned memory on the host with cudaHostAlloc, allocate the device memory for the matrices and the device array containing the device pointers to the matrices, asynchronously copy each matrix to the device, perform the operations, and then asynchronously copy the decomposed matrix and pivots back to the host to perform the back-substitution with a single right-hand side:

REAL(8), POINTER, DIMENSION(:,:,:) :: A
INTEGER, DIMENSION(:,:), POINTER :: ipiv
TYPE(C_PTR) :: cPtr_A, cPtr_ipiv
TYPE(C_PTR), ALLOCATABLE, DIMENSION(:), TARGET :: dPtr_A
TYPE(C_PTR) :: dPtr_ipiv, dPtr_A_d, dPtr_info
INTEGER(C_SIZE_T) :: sizeof_A, sizeof_ipiv

...

stat = cudaHostAlloc(cPtr_A, sizeof_A, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_A, A, (/m,m,nbatch/))
stat = cudaHostAlloc(cPtr_ipiv, sizeof_ipiv, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_ipiv, ipiv, (/m,nbatch/))

ALLOCATE(dPtr_A(nbatch))
DO ibatch=1,nbatch
  stat = cudaMalloc(dPtr_A(ibatch), m*m*sizeof_double)
END DO
stat = cudaMalloc(dPtr_A_d, nbatch*sizeof_cptr)
stat = cublasSetVector(nbatch, sizeof_cptr, C_LOC(dPtr_A(1)), 1, dPtr_A_d, 1)
stat = cudaMalloc(dPtr_ipiv, m*nbatch*sizeof_cint)
stat = cudaMalloc(dPtr_info, nbatch*sizeof_cint)

...

!$OMP PARALLEL DEFAULT(shared) PRIVATE( stat, ibatch )
!$OMP DO
DO ibatch = 1,nbatch
  stat = cublasSetMatrixAsync(m, m, sizeof_double, C_LOC(A(1,1,ibatch)), m, dPtr_A(ibatch), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL

...

stat = cublasDgetrfBatched(cublas_handle, m, dPtr_A_d, m, dPtr_ipiv, dPtr_info, nbatch)

...

stat = cublasGetMatrixAsync(m, nbatch, sizeof_cint, dPtr_ipiv, m, C_LOC(ipiv(1,1)), m, mystream)

!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, stat )
!$OMP DO
DO ibatch = 1,nbatch
  stat = cublasGetMatrixAsync(m, m, sizeof_double, dPtr_A(ibatch), m, C_LOC(A(1,1,ibatch)), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL

...

!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, x, stat )
!$OMP DO
DO ibatch = 1,nbatch
  x = rhs(:,ibatch)
  CALL dgetrs( 'N', m, 1, A(1,1,ibatch), m, ipiv(1,ibatch), x(1), m, info )
  rhs(:,ibatch) = x
END DO
!$OMP END DO
!$OMP END PARALLEL

...

I'd rather not have to do this last step, but the cublasDtrsmBatched routine limits the matrix size to 32, and mine are size 80 (a batched Dtrsv would be better, but this doesn't exist). The cost of launching multiple individual cublasDtrsv kernels makes performing the back-sub on the device untenable.

There are other operations which I need to perform between calls to cublasDgetrfBatched and cusparseDgtsvStridedBatch. Most of these are currently being performed on the host with OpenMP being used to parallelize the loops at the batched level. Some of the operations, like matrix-vector multiplication for each of the matrices being decomposed for example, are being computed on the device with OpenACC:

!$ACC DATA COPYIN(A) COPYIN(x) COPYOUT(Ax)

...

!$ACC KERNELS
  DO ibatch = 1,nbatch
    DO i = 1,m
      Ax(i,ibatch) = zero
    END DO
    DO j = 1,m
      DO i = 1,m
        Ax(i,ibatch) = Ax(i,ibatch) + A(i,j,ibatch)*x(j,ibatch)
      END DO
    END DO
  END DO
!$ACC END KERNELS

...

!$ACC END DATA

I'd like to place more of the computation on the GPU with OpenACC, but to do so I need to be able to interface the two. Something like the following:

!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)

!$ACC HOST_DATA USE_DEVICE(A)
DO ibatch = 1,nbatch
  A_d(ibatch) = acc_deviceptr(A(1,1,ibatch))
END DO
!$ACC END HOST_DATA

...

!$ACC HOST_DATA USE_DEVICE(ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA

...

!$ACC END DATA

I know the host_data construct with the host_device clauses would be appropriate in most cases, but since I need to actually pass to cuBLAS a device array containing the pointers to the matrices on the device, I'm not sure how to proceed.

Can anyone offer any insight?

Thanks

Was it helpful?

Solution

!! Put everything on the device !$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)

!! populate the device A_d array !$ACC parallel loop DO ibatch = 1,nbatch A_d(ibatch) = A(1,1,ibatch) END DO !$ACC end parallel

...

!! send the device address of A_d to the device !$ACC HOST_DATA USE_DEVICE(A_d,ipiv,info) stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch) !$ACC END HOST_DATA

...

!$ACC END DATA


                                       or

!! Put everything but A_d on the device !$ACC DATA COPYIN(A) CREATE(info) COPYOUT(ipiv)

!! populate the host A_d array DO ibatch = 1,nbatch A_d(ibatch) = acc_deviceptr( A(1,1,ibatch) ) END DO

!! copy A_d to the device !$acc data copyin( A_d ) ...

!! send the device address of A_d and others to the device !$ACC HOST_DATA USE_DEVICE(A_d,ipiv,info) stat = cublasDgetrfBatched(cublas_handle, m, A_d, m, ipiv, info, nbatch) !$ACC END HOST_DATA

... !$acc end data

!$ACC END DATA

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top