Armadillo+OpenBLAS slower than MATLAB?

Question 1

Key observation is that Armadillo exp() function is way slower than MATLAB. Similar overhead is observed in log(), pow() and sqrt().

Question 2

First, the bottleneck is not exp(), though std::exp is slow. The problem is randn().

on my machine, randn() takes most of the time. And when I use MKL_VSL 's implementation of randn, the time cost dropped from 12s to 4s, comparable to matlab's 3s or so.

#include <iostream>
#include <armadillo>
#include <ctime>
#include "mkl_vml.h"
#include "mkl_vsl.h"
using namespace std;
using namespace arma;

#define SEED 0
#define BRNG VSL_BRNG_MCG31
#define METHOD 0
int main()
{
clock_t start;
VSLStreamStatePtr stream;
start = clock();
vslNewStream(&stream, BRNG, SEED);
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
double tmp = sqrt(dt);
vec tmp2=100*zeros<vec>(R);
vec tmp3=100*zeros<vec>(R);
for (unsigned int iT=0; iT<T; ++iT)
{
    vdRngGaussian(METHOD,stream, R, tmp3.memptr(), 0, 1);
    tmp2 =(r - 0.5 * Vol * Vol) * dt + Vol * tmp * tmp3;
    vdExp(R, tmp2.memptr(), tmp3.memptr());
    S = Spre%tmp3;
    Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
vslDeleteStream(&stream);
//system("pause");
return 0;
}

Question 3

Just a guess, but it looks like you need to set the number of threads to use in OpenBLAS via the OPENBLAS_NUM_THREADS environment variable.

Try something like:

set OPENBLAS_NUM_THREADS=4

...on the command line before you run your program. Substitute the number of cores in your system where I put "4" (some would say set it to twice the number of cores in your system--YMMV).

Question 4

Make sure you have Streaming SIMD Extensions enabled when you compile your code. In Visual Studio, check your project C/C++ compiler code generation options.