No, it is not possible to perform in-place operations like gemm
using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).
Having said that, this comment:
.... much time are spend to malloc space and copy data from device to device for these temporary matrices.
makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:
// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B
can be replaced by
multiply(C, A, B);
float * tmp = A; A = C; C = tmp;
ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.
If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.