I see three main issues related to the fact that each thread should perform a whole minimization process (which kind of Matlab minimization routine are you currently using? fminunc
? fminsearch
? minFunc
?).
Minimizations can be demanding in terms of needed temporary variables. This could limit the performance of your algorithm due to the need of storing temporaries and transacting them, for example, with global memory, depending of course on how do you mean to implement it.
You should also carefully take into account thread synchronization, since the time needed for accomplishing the minimization process could change from thread to thread.
Matlab has very effective optimization routines, whose performance are generally difficult (but, of course, not impossible) to replicate by a custom implementation. In my experience, Matlab's
fminunc
is more effective than the Broyden-Fletcher-Goldfarb-Shanno equivalent routine provided by the NAG. So, if you are trying to translate one of the above optimization routines, then you could end up by a less satisfactory result.
I have faced many optimization problems using Matlab accelerated with CUDA and my "gold rule" is to use one of the Matlab's optimization routines and accelerate the solution of the direct problem (calculation of the functional) and the functional gradient by purposely written CUDA codes interfaced with Matlab by mex-files. Take into account that especially the gradient needs to (and can) be accelerated since the calculation of the functional derivative by finite differences are independent and require the invokation of as many functional computation routines as is the number of optimization parameters.
EDIT
Suppose that I have to optimize the objective functional objfun
. What I'm doing is to code objfun
in CUDA with a mex-file interface, compiling it by nvcc
and then linking it under Matlab.
As I'm using Matlab 2010, the CUDA function is compiled by nvcc
and transformed in C++ code by the command
system(sprintf('nvcc -I"%s/extern/include" --cuda "mexfun.cu" --output-file "mexfun.cpp"', matlabroot));
and then linked to Matlab by
mex -I/opt/cuda/include -L/opt/cuda/lib -lcudart mexfun.cpp
as suggested in Compiling CUDA C/C++ mex code under linux.
Then, when using, for example, fminunc(@mexfun,...)
, Matlab will optimize the objective functional and each evaluation of it will be performed (and thus accelerated) on the GPU. When analytically available, I'm also coding the gradient calculation by the same approach, since finite differences used to evaluate the gradient can significantly slow down the whole optimization process.
For Matlab 2013 and Windows systems, see Creating mex files from CUDA code.
EDIT Structure of mexfun.cu (objective function)
// Do not change the function name (`mexFunction`) and the function arguments (`nlhs`, `plhs`, ...).
void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])
{
/* Maps Matlab's pointers to the input variables to CUDA pointers */
double* input_1 = mxGetPr(prhs[0]);
double* input_2 = mxGetPr(prhs[1]);
/* Recovers the size of the input matrices */
int dimx = mxGetN(prhs[0]);
...
int dimu = mxGetM(prhs[3]);
/* Memory allocations on the host */
cuDoubleComplex* hfoo = (cuDoubleComplex *)malloc(sizeof(cuDoubleComplex)*dimx);
...
/* Memory allocations on the device */
cuDoubleComplex* dfoo; cudaMalloc((void*)&d_Kernel_Matrix,dimx*sizeof(cuDoubleComplex));
...
/* Memory transfer from host to device */
cudaMemcpy(dfoo,hfoo,dimx*sizeof(cuDoubleComplex),cudaMemcpyHostToDevice);
....
/* Kernel launch */
dim3 dimBlock(BLOCK_SIZE_X,BLOCK_SIZE_Y);
Kernel_To_Be_Launched <<<dimGrid,dimBlock >>>(hfoo,dfoo,dimx);
/* Copy the results from device to host */ cudaMemcpy(hfoo,dfoo,dimx*sizeof(cuDoubleComplex),cudaMemcpyDeviceToHost);
/* Passing the output matrices to MATLAB */
plhs[0] = mxCreateDoubleMatrix(1,dimu,mxCOMPLEX);
double* hfoo_re = mxGetPr(plhs[0]);
double* hfoo_im = mxGetPi(plhs[0]);
/* Freeing host memory */
free(hfoo);
...
/* Freeing device memory */
cudaFree(dfoo);
}