Fast solution of dense linear system of fixed dimension (N=9), symmetric, positive-semidefinite

Question 1

The matrix being symmetric, positive-semidefinite, the Cholesky decomposition is strictly superior to the LU decomposition. (roughly twice faster than LU, whatever the size of the matrix. Source : "Numerical Linear Algebra" by Trefethen and Bau)

It is also de facto the standard for small dense matrices (source : I do a PhD in computational mathematics) Iterative methods are less efficient than direct methods unless the system becomes large enough (quick rule of thumb that means nothing, but that is always nice to have : on any modern computer, any matrix smaller than 100*100 is definitely a small matrix that needs direct methods, rather than iterative ones)

Now, I do not recommend to do it yourself. There are tons of good libraries out there that have been thoroughly tested. But if I have to recommend you one, it would be Eigen :

No installation required (header only library, so no library to link, only #include<>)
Robust and efficient (they have a lot of benchmarks on the main page, and the results are nice)
Easy to use and well documented

By the way, here in the documentation, you have the various pros and cons of their 7 direct linear solvers in a nice, concise table. It seems that in your case, LDLT (a variation of Cholesky) wins

Question 2

Generally, one is best off using an existing library, rather than a roll-your-own approach, as there are many tedious details to attend to in pursuit of a fast, stable numerical implementation.

Here's a few to get you started:

Eigen library (my personal preference):
http://eigen.tuxfamily.org/dox/QuickRefPage.html#QuickRef_Headers

Armadillo: http://arma.sourceforge.net/

Search around and you'll find plenty of others.

Question 3

I would recommend LU decomposition, especially if "solved millions of times" really means "solved once and applied to millions of vectors". You'll create the LU decomposition, save it, and apply forward-back substitution against as many r.h.s. vectors as you wish.

It's more stable in the face of roundoff if you use pivoting.

Question 4

LU for a symmetric semi-definite matrix does not make much sense: you destroy a nice property of your input data performing unnecessary operations.

Choice between LLT or LDLT really depends on the condition number of your matrices, and how you intend to treat edge cases. LDLT should be used only if you can prove a statistically significant improve in accuracy, or if robustness is of paramount importance to your app.

(Without a sample of your matrices it is hard to give sound advice, but I suspect that with such a small order N=9, pivoting the small diagonal terms toward the bottom part of D is really not necessary. So I would start with classical Cholesky and simply abort factorization if the diag terms become to small with respect to some sensibly chosen tolerance.)

Cholesky is pretty simple to code, and if you strive for a really fast code, it is better to implement it yourself.

Question 5

Like others above, I recommend cholesky. I've found that the increased number of additions, subtractions and memory accesses means that LDLt is slower than cholesky.

There are in fact a number a number of variations on cholesky, and which one will be fastest depends on the representation you choose for your matrices. I generally use a fortran style representation, that is a matrix M is a double* M with M(i,j) being m[i+dim*j]; for this I reckon that an upper triangular cholesky is (a little) the fastest, that is one seeks upper triangular U with U'*U = M.

For fixed, small, dimension it is definitely worth considering writing a version that uses no loops. A relatively straightforward way to do this is to write a program to do it. As I recall, using a routine that deals with the general case as a template, it only took a morning to write a program that would write a specific fixed dimension version. The savings can be considerable. For example my general version takes 0.47 seconds to do a million 9x9 factorisations, while the loopless version takes 0.17 seconds -- these timings running single threaded on a 2.6GHz pc.

To show that this is not a major task, I've included the source of such a program below. It includes the general version of the factorisation as a comment. I've used this code in circumstances where the matrices are not close to singular, and I reckon it works ok there; however it may well be too crude for more delicate work.

/*  ----------------------------------------------------------------
**  to write fixed dimension ut cholesky routines
**  ----------------------------------------------------------------
*/
#include    <stdio.h>
#include    <stdlib.h>
#include    <math.h>
#include    <string.h>
#include    <strings.h>
/*  ----------------------------------------------------------------
*/
#if 0
static  inline  double  vec_dot_1_1( int dim, const double* x, const double* y)
{   
double  d = 0.0;
    while( --dim >= 0)
    {   d += *x++ * *y++;
    }
    return d;
}

   /*   ----------------------------------------------------------------
    **  ut cholesky: solve U'*U = P for ut U in P (only ut of P accessed)
    **  ----------------------------------------------------------------
    */   

int mat_ut_cholesky( int dim, double* P)
{
int i, j;
double  d;
double* Ucoli;

    for( Ucoli=P, i=0; i<dim; ++i, Ucoli+=dim)
    {   /* U[i,i] = P[i,i] - Sum{ k<i | U[k,i]*U[k,i]} */
        d = Ucoli[i] - vec_dot_1_1( i, Ucoli, Ucoli);
        if ( d < 0.0)
        {   return 0;
        }
        Ucoli[i] = sqrt( d);
        d = 1.0/Ucoli[i];
        for( j=i+1; j<dim; ++j)
        {   /* U[i,j] = (P[i,j] - Sum{ k<i | U[k,i]*U[k,j]})/U[i,i] */
            P[i+j*dim] = d*(P[i+j*dim] - vec_dot_1_1( i, Ucoli, P+j*dim));
        }
    }
    return 1;
}
/*  ----------------------------------------------------------------
*/
#endif

/*  ----------------------------------------------------------------
**
**  ----------------------------------------------------------------
*/
static  void    write_ut_inner_step( int dim, int i, int off)
{
int j, k, l;
    printf( "\td = 1.0/P[%d];\n", i+off);
    for( j=i+1; j<dim; ++j)
    {   k = i+j*dim;
        printf( "\tP[%d] = d * ", k);
        if ( i)
        {   printf( "(P[%d]", k);
            printf( " - (P[%d]*P[%d]", off, j*dim);
            for( l=1; l<i; ++l)
            {   printf( " + P[%d]*P[%d]", l+off, l+j*dim);
            }
            printf( "));");
        }
        else
        {   printf( "P[%d];", k);
        }
        printf( "\n");
    }
}

static  void    write_dot( int n, int off)
{   
int i;
    printf( "P[%d]*P[%d]", off, off);
    for( i=1; i<n; ++i)
    {   printf( "+P[%d]*P[%d]", off+i, off+i);
    }
}

static  void    write_ut_outer_step( int dim, int i, int off)
{
    printf( "\td = P[%d]", off+i);
    if ( i)
    {   printf( " - (");
        write_dot( i, off);
        printf( ")");
    }
    printf( ";\n");

    printf( "\tif ( d <= 0.0)\n");
    printf( "\t{\treturn 0;\n");
    printf( "\t}\n");

    printf( "\tP[%d] = sqrt( d);\n", i+off);
    if ( i < dim-1)
    {   write_ut_inner_step( dim, i, off);  
    }
}

static  void    write_ut_chol( int dim)
{
int i;
int off=0;
    printf( "int\tut_chol_%.2d( double* P)\n", dim);
    printf( "{\n");
    printf( "double\td;\n");
    for( i=0; i<dim; ++i)
    {   write_ut_outer_step( dim, i, off);
        printf( "\n");
        off += dim;
    }
    printf( "\treturn 1;\n");
    printf( "}\n");
}
/*  ----------------------------------------------------------------
*/


/*  ----------------------------------------------------------------
**
**  ----------------------------------------------------------------
*/
static  int read_args( int* dim, int argc, char** argv)
{
    while( argc)
    {   if ( strcmp( *argv, "-h") == 0)
        {   return 0;
        }
        else if (strcmp( *argv, "-d") == 0)
        {   --argc; ++argv;
            *dim = atoi( (--argc, *argv++));
        }
        else
        {   break;
        }
    }
    return 1;
}

int main( int argc, char** argv)
{
int dim = 9;
    if( read_args( &dim, --argc, ++argv))
    {   write_ut_chol( dim);
    }
    else
    {   fprintf( stderr, "usage: wchol (-d dim)? -- writes to stdout\n");
    }
    return EXIT_SUCCESS;
}
/*  ----------------------------------------------------------------
*/

Question 6

Because of its ease of use, you can take Eigen solvers just for comparison. For specific use case a specific solver might be faster although another is supposed to be better. For that, you can measure runtimes for the each algorithm just for the selection. After that you can implement the desired option (or find an existing one that fits your needs the best).