No there's no such things in CUDA library but you might look at this code to help you designing a solution in CUDA:
https://github.com/poliu2s/MKL/blob/master/matrix_exponential.cpp
If you are working on an architecture 3.5, it could be easier to solve your problem (with dynamic paralleslism) by calling a __global__
kernel from an other __global__
kernel without returning on the host so you can set the configuration you want to execute it (threads and blocks).
Basically:
__global__ child( ... )
{
....
}
__global__ parent( ... )
{
child<<< ..., ... >>>( ... )
}
Hope this can help