Could a CUDA kernel call a cublas function?

Question 1

Yes, it can (until (and excluding) version CUDA 10).

"The language interface and Device Runtime API available in CUDA C/C++ is a subset of the CUDA Runtime API available on the Host. The syntax and semantics of the CUDA Runtime API have been retained on the device in order to facilitate ease of code reuse for API routines that may run in either the host or device environments. A kernel can also call GPU libraries such as CUBLAS directly without needing to return to the CPU." Source

Here you can see and Matrix-Vector Multiplication using cuda and CUBLAS library function cublasSgemv.

Bear in mind, however that there is no longer a device CUBLAS capability in CUDA 10.. From Robert_Crovella one can cite:

The current recommendation would be to see if CUTLASS 2 will help (it is mostly focused on GEMM related activities). If not, write your own code to perform the function, or call cublas from host code.

Nonetheless, currently there are several implementation online of Matrix-Vector Multiplication, for instance 1, 2, among others.

Question 2

Make sure you are using the device library to call the cublas. You can't use the same library that you used to call it from the host; details about using the cuda device library can be found on cuda toolkit: http://docs.nvidia.com/cuda/cublas/index.html#device-api

Look at the cuda 5 samples under 7_CUDALibraries/ .

Question 3

Here has a code example for your problem, I think this code link could help you. Thanks the Github's author.

__global__ void invokeDeviceCublasSgemm(cublasStatus_t *returnValue,
                                    int n,
                                    const float *d_alpha,
                                    const float *d_A,
                                    const float *d_B,
                                    const float *d_beta,
                                    float *d_C)
{   
    cublasHandle_t cnpHandle;
    cublasStatus_t status = cublasCreate(&cnpHandle);

    if (status != CUBLAS_STATUS_SUCCESS){
        *returnValue = status;
        return;
    }

    /* Perform operation using cublas */
    status = cublasSgemm(cnpHandle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                n, n, n,
                d_alpha,
                d_A, n,
                d_B, n,
                d_beta,
                d_C, n);
    cublasDestroy(cnpHandle);
    *returnValue = status;
}