سؤال

I see CUBLAS may be an efficient algorithm package for a single large matrices multiplication or addition etc. But in a common setting, most computations are dependent. So the next step relies on the result of the previous step.

This causes one problem, because the output matrix has to be different from input matrix in CUBLAS routine( as input matrices are const ), much time are spend to malloc space and copy data from device to device for these temporary matrices.

So is it possible to do things like multiply(A, A, B), where the first argument is ouput matrix and the second/third are input matrices, to avoid extra memory manipulation time? Or is there a better workaround?

Thanks a lot !

هل كانت مفيدة؟

المحلول

No, it is not possible to perform in-place operations like gemm using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).

Having said that, this comment:

.... much time are spend to malloc space and copy data from device to device for these temporary matrices.

makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:

// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B

can be replaced by

multiply(C, A, B);
float * tmp = A; A = C; C = tmp;

ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.

If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.

نصائح أخرى

You could pre alloc a buffer matrix, and copy the input matrix A to the buffer before the mat-mul operation.

Memcopy(buff, A);
Multiply(A, buffer, B);

By reusing the buffer, you don't need to allocate the buffer every time, and the overhead will be only one mem copy for each mat-mul. When your matrix is large enough, the time cost of the overhead will take very small portion and can be ignored.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top