cublas: same input and output matrix for better performance?

Question 1

No, it is not possible to perform in-place operations like gemm using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).

Having said that, this comment:

.... much time are spend to malloc space and copy data from device to device for these temporary matrices.

makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:

// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B

can be replaced by

multiply(C, A, B);
float * tmp = A; A = C; C = tmp;

ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.

If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.

Question 2

You could pre alloc a buffer matrix, and copy the input matrix A to the buffer before the mat-mul operation.

Memcopy(buff, A);
Multiply(A, buffer, B);

By reusing the buffer, you don't need to allocate the buffer every time, and the overhead will be only one mem copy for each mat-mul. When your matrix is large enough, the time cost of the overhead will take very small portion and can be ignored.