Speeding up Matlab optimizations by CUDA

Question 1

I see three main issues related to the fact that each thread should perform a whole minimization process (which kind of Matlab minimization routine are you currently using? fminunc? fminsearch? minFunc?).

Minimizations can be demanding in terms of needed temporary variables. This could limit the performance of your algorithm due to the need of storing temporaries and transacting them, for example, with global memory, depending of course on how do you mean to implement it.
You should also carefully take into account thread synchronization, since the time needed for accomplishing the minimization process could change from thread to thread.
Matlab has very effective optimization routines, whose performance are generally difficult (but, of course, not impossible) to replicate by a custom implementation. In my experience, Matlab's fminunc is more effective than the Broyden-Fletcher-Goldfarb-Shanno equivalent routine provided by the NAG. So, if you are trying to translate one of the above optimization routines, then you could end up by a less satisfactory result.

I have faced many optimization problems using Matlab accelerated with CUDA and my "gold rule" is to use one of the Matlab's optimization routines and accelerate the solution of the direct problem (calculation of the functional) and the functional gradient by purposely written CUDA codes interfaced with Matlab by mex-files. Take into account that especially the gradient needs to (and can) be accelerated since the calculation of the functional derivative by finite differences are independent and require the invokation of as many functional computation routines as is the number of optimization parameters.

EDIT Suppose that I have to optimize the objective functional objfun. What I'm doing is to code objfun in CUDA with a mex-file interface, compiling it by nvcc and then linking it under Matlab.

As I'm using Matlab 2010, the CUDA function is compiled by nvcc and transformed in C++ code by the command

system(sprintf('nvcc -I"%s/extern/include" --cuda "mexfun.cu" --output-file "mexfun.cpp"', matlabroot));

and then linked to Matlab by

mex -I/opt/cuda/include -L/opt/cuda/lib -lcudart mexfun.cpp

as suggested in Compiling CUDA C/C++ mex code under linux.

Then, when using, for example, fminunc(@mexfun,...), Matlab will optimize the objective functional and each evaluation of it will be performed (and thus accelerated) on the GPU. When analytically available, I'm also coding the gradient calculation by the same approach, since finite differences used to evaluate the gradient can significantly slow down the whole optimization process.

For Matlab 2013 and Windows systems, see Creating mex files from CUDA code.

EDIT Structure of mexfun.cu (objective function)

// Do not change the function name (`mexFunction`) and the function arguments (`nlhs`, `plhs`, ...). 
void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])

{
    /* Maps Matlab's pointers to the input variables to CUDA pointers */
    double* input_1     = mxGetPr(prhs[0]);
    double* input_2     = mxGetPr(prhs[1]);

    /* Recovers the size of the input matrices */
    int dimx = mxGetN(prhs[0]);
    ...         
    int dimu = mxGetM(prhs[3]);         

    /* Memory allocations on the host */
    cuDoubleComplex* hfoo = (cuDoubleComplex *)malloc(sizeof(cuDoubleComplex)*dimx);
    ...

   /* Memory allocations on the device */
   cuDoubleComplex* dfoo; cudaMalloc((void*)&d_Kernel_Matrix,dimx*sizeof(cuDoubleComplex));
   ...

  /* Memory transfer from host to device */
  cudaMemcpy(dfoo,hfoo,dimx*sizeof(cuDoubleComplex),cudaMemcpyHostToDevice);
  ....

  /* Kernel launch */
  dim3 dimBlock(BLOCK_SIZE_X,BLOCK_SIZE_Y);
  Kernel_To_Be_Launched <<<dimGrid,dimBlock >>>(hfoo,dfoo,dimx);

 /* Copy the results from device to host */ cudaMemcpy(hfoo,dfoo,dimx*sizeof(cuDoubleComplex),cudaMemcpyDeviceToHost);


 /* Passing the output matrices to MATLAB */
 plhs[0] = mxCreateDoubleMatrix(1,dimu,mxCOMPLEX);
 double* hfoo_re = mxGetPr(plhs[0]);
 double* hfoo_im = mxGetPi(plhs[0]);

 /* Freeing host memory */
 free(hfoo);
 ...

 /* Freeing device memory */
 cudaFree(dfoo);

}

Question 2

I would not consider myself an expert in CUDA (at all), but I have been using it extensively for the past little while. My guess is that while you may indeed get some speed-up, it is hard to tell how much without the detailed knowledge of the problem that only you have, and probably not without some effort. Which is to say, you probably can't just "throw it over the wall," so to speak and hope that the CUDA compiler will catch all the pieces.

My immediate concerns would be related to memory management and bus traffic, as CUDA has very strict rules for memory usage. While the compiler will generally keep things chugging along as best it can, performance will degrade if you use the memory and bus inefficiently.

Specifically, in order to get good performance, you want to load pieces of your problem into the shared memory of the various streaming multiprocessors. The available shared memory to an SM on modern cards is only 48K. You describe your problem in chunks of 1500 x 9 (floats, I assume) which is already more than 48K. Moreover, the shared memory on an SM is used by all the processors on an SM. If your problem takes up all 48K of an SM, then most of that SM will sit idle.

So that sounds bad. But, if there is a way that you can compute the answer to those 1500 x 9 chunks in smaller pieces and recombine, then you may have a candidate for a GPU approach. Some creativity is often required.

But I emphasize, this is only one concern. It is the one that jumped out at me, as being similar to the issue I'm wrestling with for another application.

JackOLantern raises others, and there are also read/write patterns, etc.