Question

Suppose I have a c++ code with many small functions, in each of which i will typically need a matrix float M1(n,p) with n,p known at run-time to contain the results of intermediate computations (no need to initialize M1, just to declare it because each function will just overwrite over all rows of M1).

Part of the reason for this is that each function works on an original data matrix that it can't modify, so many operations (sorting, de-meaning, sphering) need to be done on "elsewhere".

Is it better practice to create a temporary M1(n,p) within each function, or rather once and for all in the main() and pass it to each function as a sort of bucket that each function can use as scrap space?

n and p are often moderately large [10^2-10^4] for n and [5-100] for p.

(originally posted at the codereview stackexchange but moved here).

Best,

Was it helpful?

Solution

  1. heap allocations are indeed, quite expensive.
  2. premature optimization is bad, but if your library is quite general and the matrices are huge, it may not be premature to seek an efficient design. After all, you don't want to modify your design after you accumulated many dependencies to it.
  3. there are various levels in which you can tackle this problem. You can, for example, avoid the heap allocation expense by tackling it at the memory allocator level (per-thread memory pool, e.g.)
  4. while heap allocation is expensive, you are creating one giant matrix only to do some rather expensive operations on the matrices (typically linear complexity or worse). Relatively speaking, allocating a matrix on the free store may not be that expensive compared to what you are inevitably going to have to do with it subsequently, so it may actually be quite cheap in comparison to the overall logic of a function like sorting.

I recommend you write the code naturally, taking into account #3 as a future possibility. That is, don't take in references to matrix buffers for intermediary computations to accelerate the creation of temporaries. Make the temporaries and return them by value. Correctness and good, clear interfaces come first.

Mostly the goal here is to separate the creational policy of a matrix (via allocator or other means) which gives you that breathing room to optimize as an afterthought without changing too much existing code. If you can do it by modifying only the implementation details of the functions involved or, better yet, modifying only the implementation of your matrix class, then you're really well off because then you're free to optimize without changing the design, and any design which allows that is generally going to be complete from an efficiency standpoint.


WARNING: The following is only intended if you really want to squeeze the most out of every cycle. It is essential to understand #4 and also get yourself a good profiler. It's also worth noting that you'll probably do better by optimizing memory access patterns for these matrix algorithms than trying to optimize away the heap allocation.


If you need to optimize the memory allocation, consider optimizing it with something general like a per-thread memory pool. You could make your matrix take in an optional allocator, for instance, but I emphasize optional here and I'd also emphasize correctness first with a trivial allocator implementation.

In other words:

Is it better practice to declare M1(n,p) within each function, or rather once and for all in the main() and pass it to each function as a sort of bucket that each function can use as scrap space.

Go ahead and create M1 as a temporary in each function. Try to avoid requiring the client to make some matrix that has no meaning to him/her only to compute intermediary results. That would be exposing an optimization detail which is something we should strive not to do when designing interfaces (hide all details that clients should not have to know about).

Instead, focus on more general concepts if you absolutely want that option to accelerate the creation of these temporaries, like an optional allocator. This fits in with practical designs like with std::set:

std::set<int, std::less<int>, MyFastAllocator<int>> s; // <-- okay

Even though most people just do:

std::set<int> s;

In your case, it might simply be: M1 my_matrix(n, p, alloc);

It's a subtle difference, but an allocator is a much more general concept we can use than a cached matrix which otherwise has no meaning to the client except that it's some kind of cache that your functions require to help them compute results faster. Note that it doesn't have to be a general allocator. It could just be your preallocated matrix buffer passed in to a matrix constructor, but conceptually it might be good to separate it out merely for the fact that it is something a bit more opaque to clients.

Additionally, constructing this temporary matrix object would also require care not to share it across threads. That's another reason you probably want to generalize the concept a bit if you do go the optimization route, as something a bit more general like a matrix allocator can take into account thread safety or at the very least emphasize more by design that a separate allocator should be created per thread, but a raw matrix object probably cannot.


The above is only useful if you really care about the quality of your interfaces first and foremost. If not, I'd recommend going with Matthieu's advice as it is much simpler than creating an allocator, but both of us emphasize making the accelerated version optional.

OTHER TIPS

Do not use premature optimisation. Create something that works properly and well and optimise later if it is shown to be slow.

(By the way I don't think stackoverflow is the correct place for it either).

In reality if you want to speed up your application operating on large matrices, using concurrency would be your solution. And if you are using concurrency, you are likely to get into far more trouble if you have one big global matrix.

Essentially it means you can never have more than one calculation happening at a time, even if you have the memory for it.

The design of your matrix needs to be optimal. We would have to look at this design.

I would therefore generally say in your code, no, do NOT create one big global matrix because it sounds wrong for what you want to do with it.

First try to define the matrix inside the function. That is definitely the better design choice. But if you get performance losses you can't affort, I think the "passing buffer per reference" is ok, as long as you keep in mind that the functions aren't thread safe anymore. If at any point you use threads, each thread needs it own buffer.

There are advantages in terms of performance in requiring an externally supplied buffer, especially when you are required to chain the functions that make use of it.

However, from a user point of view, it can soon get annoying.

I have often found that it's simple enough in C++ to get the best of both worlds by simply offerring both ways:

int compute(Matrix const& argument, Matrix& buffer);

inline int compute(Matrix const& argument) {
  Matrix buffer(argument.width, argument.height);
  return compute(argument, buffer);
}

This very simple wrapping means that the code is written once, and two slightly different interfaces are presented.

The more involved api (taking a buffer) is also slightly less safe as the buffer must respect some size constraints wrt the argument so you might want to further insulate the fast api (for example behind a namespace) as to encourage users to use the slower but safer interface first, and only try out the fast one when it's proven necessary.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top