Question

New to SO. I am test-driving Armadillo+OpenBLAS, and a simple Monte-Carlo geometric Brownian motion logic shows much longer runtime than MATLAB. I believe something must be wrong.

Environment: Intel i-5 4 core, 8GB ram, VS 2012 Express, Armadillo 4.2, OpenBLAS (official x64 binary) v0.2.9.rc2,

MATLAB takes 2 seconds for the same logic, but Armadillo+OB takes 12 seconds. I also noticed that the program is running on single thread, but I turned to OpenBLAS because I heard of its multi-core capability.

Thanks for any advice.

#include <iostream>
#include <armadillo>
#include <ctime>

using namespace std;
using namespace arma;

int main()
{
clock_t start;
start = clock();
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
for (unsigned int iT=0; iT<T; ++iT)
{
    S = Spre%exp((r-0.5*Vol*Vol)*dt + Vol*sqrt(dt)*randn(R));
    Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
system("pause");
return 0;
}
Was it helpful?

Solution 2

Key observation is that Armadillo exp() function is way slower than MATLAB. Similar overhead is observed in log(), pow() and sqrt().

OTHER TIPS

First, the bottleneck is not exp(), though std::exp is slow. The problem is randn().

on my machine, randn() takes most of the time. And when I use MKL_VSL 's implementation of randn, the time cost dropped from 12s to 4s, comparable to matlab's 3s or so.

#include <iostream>
#include <armadillo>
#include <ctime>
#include "mkl_vml.h"
#include "mkl_vsl.h"
using namespace std;
using namespace arma;

#define SEED 0
#define BRNG VSL_BRNG_MCG31
#define METHOD 0
int main()
{
clock_t start;
VSLStreamStatePtr stream;
start = clock();
vslNewStream(&stream, BRNG, SEED);
unsigned int R=100000;
vec Spre = 100*ones<vec> (R);
vec S = zeros<vec> (R);
double r = 0.03;
double Vol = 0.2;
double TTM = 5;
unsigned int T=260*TTM;
double dt = TTM/T;
double tmp = sqrt(dt);
vec tmp2=100*zeros<vec>(R);
vec tmp3=100*zeros<vec>(R);
for (unsigned int iT=0; iT<T; ++iT)
{
    vdRngGaussian(METHOD,stream, R, tmp3.memptr(), 0, 1);
    tmp2 =(r - 0.5 * Vol * Vol) * dt + Vol * tmp * tmp3;
    vdExp(R, tmp2.memptr(), tmp3.memptr());
    S = Spre%tmp3;
    Spre = S;
}
cout << mean(S) << endl;
cout << (clock()-start) / (double) CLOCKS_PER_SEC << endl;
vslDeleteStream(&stream);
//system("pause");
return 0;
}

Just a guess, but it looks like you need to set the number of threads to use in OpenBLAS via the OPENBLAS_NUM_THREADS environment variable.

Try something like:

set OPENBLAS_NUM_THREADS=4

...on the command line before you run your program. Substitute the number of cores in your system where I put "4" (some would say set it to twice the number of cores in your system--YMMV).

Make sure you have Streaming SIMD Extensions enabled when you compile your code. In Visual Studio, check your project C/C++ compiler code generation options.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top