Question

I need a way to represent a 2-D array (a dense matrix) of doubles in C++, with absolute minimum accessing overhead.

I've done some timing on various linux/unix machines and gcc versions. An STL vector of vectors, declared as:

vector<vector<double> > matrix(n,vector<double>(n));

and accessed through matrix[i][j] is between 5% and 100% slower to access than an array declared as:

double *matrix = new double[n*n];

accessed through an inlined index function matrix[index(i,j)], where index(i,j) evaluates to i+n*j. Other ways of arranging a 2-D array without STL - an array of n pointers to the start of each row, or defining the whole thing on the stack as a constant size matrix[n][n] - run at almost exactly the same speed as the index function method.

Recent GCC versions (> 4.0) seem to be able to compile the STL vector-of-vectors to nearly the same efficiency as the non-STL code when optimisations are turned on, but this is somewhat machine-dependent.

I'd like to use STL if possible, but will have to choose the fastest solution. Does anyone have any experience in optimising STL with GCC?

Was it helpful?

Solution

If you're using GCC the compiler can analyze your matrix accesses and change the order in memory in certain cases. The magic compiler flag is defined as:

-fipa-matrix-reorg

Perform matrix flattening and transposing. Matrix flattening tries to replace a m-dimensional matrix with its equivalent n-dimensional matrix, where n < m. This reduces the level of indirection needed for accessing the elements of the matrix. The second optimization is matrix transposing that attemps to change the order of the matrix's dimensions in order to improve cache locality. Both optimizations need fwhole-program flag. Transposing is enabled only if profiling information is avaliable.

Note that this option is not enabled by -O2 or -O3. You have to pass it yourself.

OTHER TIPS

My guess would be the fastest is, for a matrix, to use 1D STL array and override the () operator to use it as 2D matrix.

However, the STL also defines a type specifically for non-resizeable numerical arrays: valarray. You also have various optimisations for in-place operations.

valarray accept as argument a numerical type:

valarray<double> a;

Then, you can use slices, indirect arrays, ... and of course, you can inherit the valarray and define your own operator()(int i, int j) for 2D arrays ...

Very likely this is a locality-of-reference issue. vector uses new to allocate its internal array, so each row will be at least a little apart in memory due to each block's header; it could be a long distance apart if memory is already fragmented when you allocate them. Different rows of the array are likely to at least incur a cache-line fault and could incur a page fault; if you're really unlucky two adjacent rows could be on memory lines that share a TLB slot and accessing one will evict the other.

In contrast your other solutions guarantee that all the data is adjacent. It could help your performance if you align the structure so it crosses as few cache lines as possible.

vector is designed for resizable arrays. If you don't need to resize the arrays, use a regular C++ array. STL operations can generally operate on C++ arrays.

Do be sure to walk the array in the correct direction, i.e. across (consecutive memory addresses) rather than down. This will reduce cache faults.

My recommendation would be to use Boost.UBLAS, which provides fast matrix/vector classes.

To be fair depends on the algorithms you are using upon the matrix.

The double name[n*m] format is very fast when you are accessing data by rows both because has almost no overhead besides a multiplication and addition and because your rows are packed data that will be coherent in cache.

If your algorithms access column ordered data then other layouts might have much better cache coherence. If your algorithm access data in quadrants of the matrix even other layouts might be better.

Try to make some research directed at the type of usage and algorithms you are using. That is specially important if the matrix are very large, since cache misses may hurt your performance way more than needing 1 or 2 extra math operations to access each address.

You could just as easily do vector< double >( n*m );

You may want to look at the Eigen C++ template library at http://eigen.tuxfamily.org/ . It generates AltiVec or sse2 code to optimize the vector/matrix calculations.

There is the uBLAS implementation in Boost. It is worth a look.

http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/matrix.htm

Another related library is Blitz++: http://www.oonumerics.org/blitz/docs/blitz.html

Blitz++ is designed to optimize array manipulation.

I have done this some time back for raw images by declaring my own 2 dimensional array classes.

In a normal 2D array, you access the elements like:

array[2][3]. Now to get that effect, you'd have a class array with an overloaded [] array accessor. But, this would essentially return another array, thereby giving you the second dimension.

The problem with this approach is that it has a double function call overhead.

The way I did it was to use the () style overload.

So instead of array[2][3], change I had it do this style array(2,3).

That () function was very tiny and I made sure it was inlined.

See this link for the general concept of that: http://www.learncpp.com/cpp-tutorial/99-overloading-the-parenthesis-operator/

You can template the type if you need to.
The difference I had was that my array was dynamic. I had a block of char memory I'd declare. And I employed a column cache, so I knew where in my sequence of bytes the next row began. Access was optimized for accessing neighbouring values, because I was using it for image processing.

It's hard to explain without the code but essentially the result was as fast as C, and much easier to understand and use.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top