The batched triangular backsolver is something I hadn't tried before in CUBLAS, so I was interested to take a look and see what might be going on. Your code is rather complex, so I didn't bother trying to understand it, but when I ran it, it appeared to be failing with an internal CUBLAS launch failure:
$ cuda-memcheck ./a.out
========= CUDA-MEMCHHECK
!!!! kernel execution error.
========= Program hit error 8 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_256.00.35.dylib (cudbgGetAPIVersion + 0x27bd7) [0x4538e7]
========= Host Frame:/usr/local/cuda/lib/libcudart.dylib (cudaLaunch + 0x26c) [0x45c8c]
========= Host Frame:/usr/local/cuda/lib/libcublas.dylib (cublasZgetrfBatched + 0x1e34) [0x196ae4]
========= Host Frame:/usr/local/cuda/lib/libcublas.dylib (cublasCtrsmBatched + 0x64d) [0x1974cd]
========= Host Frame:/usr/local/cuda/lib/libcublas.dylib (cublasCtrsmBatched + 0xacb) [0x19794b]
========= Host Frame:/Users/talonmies/./a.out (_Z22TestCublasStrsmBatchedii + 0x3c1) [0x1b28]
========= Host Frame:/Users/talonmies/./a.out (main + 0x3d) [0x1b7d]
========= Host Frame:/Users/talonmies/./a.out (start + 0x35) [0x14e9]
========= Host Frame:[0x1]
(This is an OS X machine with a compute 1.2 GPU and CUDA 5.0). Error 8 is cudaErrorInvalidDeviceFunction
, which usually only comes up when a library or fatbinary doesn't have an architecture which matches or can't be JIT recompiled into something your GPU can run.
Intrigued, I wrote my own much simpler repro case from scratch:
#include <iostream>
#include <cublas_v2.h>
int main(void)
{
const int Neq = 5, Nrhs = 2, Nsys = 4;
float Atri[Neq][Neq] =
{ { 1, 6, 11, 16, 21},
{ 0, 7, 12, 17, 22},
{ 0, 0, 13, 18, 23},
{ 0, 0, 0, 19, 24},
{ 0, 0, 0, 0, 25} };
float B[Nrhs][Neq] =
{ { 1, 27, 112, 290, 595},
{ 2, 40, 148, 360, 710} };
float *syslhs[Nsys], *sysrhs[Nsys];
float *A_, *B_, **syslhs_, **sysrhs_;
size_t Asz = sizeof(float) * (size_t)(Neq * Neq);
size_t Bsz = sizeof(float) * (size_t)(Neq * Nrhs);
cudaMalloc((void **)(&A_), Asz);
cudaMalloc((void **)(&B_), Bsz * size_t(Nsys));
cudaMemcpy(A_, Atri, Asz, cudaMemcpyHostToDevice);
for(int i=0; i<Nsys; i++) {
syslhs[i] = A_;
sysrhs[i] = (float*)((char *)B_ + i*Bsz);
cudaMemcpy(sysrhs[i], B, Bsz, cudaMemcpyHostToDevice);
}
size_t syssz = sizeof(float *) * (size_t)Nsys;
cudaMalloc((void **)&syslhs_, syssz);
cudaMalloc((void **)&sysrhs_, syssz);
cudaMemcpy(syslhs_, syslhs, syssz, cudaMemcpyHostToDevice);
cudaMemcpy(sysrhs_, sysrhs, syssz, cudaMemcpyHostToDevice);
const cublasSideMode_t side = CUBLAS_SIDE_LEFT;
const cublasDiagType_t diag = CUBLAS_DIAG_NON_UNIT;
const cublasFillMode_t ulo = CUBLAS_FILL_MODE_LOWER;
const cublasOperation_t trans = CUBLAS_OP_N;
float alpha = 1.f;
cublasHandle_t handle;
cublasCreate(&handle);
cublasStrsmBatched(
handle,
side, ulo, trans, diag,
Neq, Nrhs,
&alpha,
syslhs_, Neq,
sysrhs_, Neq,
Nsys
);
for(int k=0; k<Nsys; k++) {
cudaMemcpy(B, sysrhs[k], Bsz, cudaMemcpyDeviceToHost);
for(int i=0; i<Nrhs; i++) {
for(int j=0; j<Neq; j++) {
std::cout << B[i][j] << ",";
}
std::cout << std::endl;
}
std::cout << std::endl;
}
return 0;
}
This also fails the same way as your code. At first inspection, this really does seem to be a CUBLAS internal problem, although it is very difficult to say what. About the only thing I can think of is that these solvers are only supported on compute capability 3.5 devices not supported on compute 1.x devices, but the documentation fails to mention it. Between us we have tested compute 1.2, compute 1.3, and compute 3.0[error on my part, I read K10 not T10 in your question] devices, so there isn't much else left.....
All I can suggest is trying to run your code with cuda-memcheck and see whether it reports the same error. if it does, I see a bug report to NVIDIA in your future.
EDIT: I flagrantly disregarded the EULA and used cuobjdump to explore the cubin payloads in the CUDA 5 cublas library. For the single precision batched trsm routines I found cubins for
- 32 bit sm_20
- 32 bit sm_30
- 32 bit sm_35
- 64 bit sm_20
- 64 bit sm_30
- 64 bit sm_35
There are clearly no sm_1x cubins in the library, so my compute_12 device should produce the runtime library error I see. It also explains your error with the GTX 285 and Telsa T10, which are both compute_13.
EDIT2:
As suspected, my repro code runs perfectly on a linux system with a compute_30 device under both CUDA 5.0 and CUDA 5.5 release libraries.