Shouldn't LAPACKs dsyevr function (for eigenvalues and eigenvectors) be thread-safe?

Question 1

I finally received the explanation from the LAPACK team, which I would like to quote (with permission):

I think the problem you are seeing may be caused by how the FORTRAN version of the LAPACK library you are using was compiled. Using gfortran 4.8.0 (on Mac OS 10.8), I could reproduce the problem you saw if I compile LAPACK with the -O3 option for gfortran. If I recompile the LAPACK and reference BLAS library with -fopenmp -O3, the problem goes away. There is a note in the gfortran manual stating "-fopenmp implies -frecursive, i.e., all local arrays will be allocated on the stack," so there may be local arrays used in some auxiliary routines called by dsyevr for which the default setting of the compiler is causing them to be allocated in a non-thread safe manner. In any case, allocating these on the stack, which -fopenmp seems to do, will address this issue.

I can confirm that this solves the problem for netlib-BLAS/LAPACK. One should keep in mind that the stack size is limited and has possibly to be adjusted if matrices get big and/or many.

OpenBLAS must be compiled with USE_OPENMP=1 and USE_THREAD=1 to get a single threaded and thread-safe library.

With these compiler and make flags, the sample program runs correctly with all libraries. It remains an open question, how one makes sure that the user to whom one hands ones code in the end is linking to a correctly compiled BLAS/LAPACK library? If the user would just get a segmentation fault one could add a note in a README file, but since the error is more subtle, it is not even guaranteed that the error is recognized by the user (users don't read the README file by default ;-) ). Shipping a BLAS/LAPACK with ones code is not a good idea, since it is the basic idea of BLAS/LAPACK that everyone has a specifically optimized version for his computer. Ideas are welcome...

Question 2

Re another library: GSL. It's threadsafe. But that means that you have to create workspaces for each thread and be sure that each thread uses it workspace, e.g., index pointers by thread number.

Question 3

[the following answer was added before the correct solution was known]

Disclaimer: The following is a workaround, neither does it solve the original problem, nor does it explain what goes wrong with LAPACK. It may, however, help people facing the same problem.

The old f2c'ed version of LAPACK, called CLAPACK, does not seem to have the thread-safety problem. Note that this is not a C interface to the fortran library but a version of LAPACK that has been automatically translated to C.

Compiling and linking it with the last version of CLAPACK (3.2.1) worked. So CLAPACK does seem to be thread safe in this respect. Of course, the performance of CLAPACK is not that of netlib-BLAS/LAPACK or even that of OpenBLAS/LAPACK but at least it is not as bad as that of GSL.

Here are some timings for all tested LAPACK variants (and GSL) for the diagonalization of one 1000 x 1000 matrix (on one thread of course) initialized with the init function (see question for definition).

time -p ./gsl
real 17.45
user 17.42
sys 0.01

time -p ./netlib_dsyevr
real 0.67
user 0.84
sys 0.02

time -p ./openblas_dsyevr
real 0.66
user 0.46
sys 0.01

time -p ./clapack_dsyevr
real 1.47
user 1.46
sys 0.00

This indicates that GSL is definitely no good workaround for big matrices with dimension of a few thousands, especially if you have many of them.

Question 4

It seems you prompted the LAPACK developers to introduce a "fix". Indeed, they added -frecursive to the compiler flags in make.inc.example. From testing your example code the fix seems irrelevant (the numerical errors do not go away) and not desirable (a possible performance hit).

Even if the fix was correct, -frecursive is implied by -fopenmp, so people using consistent flags are on the safe side (those using prepackaged software are not).

To conclude, please fix your code rather than confuse people.