What's faster.. Multiple variables or single array? [duplicate]

https://softwareengineering.stackexchange.com/questions/385796

19-02-2021
|

Question

I have a C-coded function that realizes a very long calculation on a microcontroller. I try to optimize it for speed at the moment. The function content is created automatically using Mathematica. It has hundreds of calculations and looks like this:

void calcResult(float *result, float arg1, float arg2, ... float argN){
   float tmp1 = arg1 * 2 + arg2;
   float tmp2 = tmp1/arg3 + tmp1;
   ...
   ...
   float tmpN = tmp320 + tmp15 * 2;
   *result = tmp2 + tmpN;
}

I asked myself if it could possibly be faster not use "tmp" variables but an array with the same size as the number of "tmp" variables or maybe to speed things up in such a function in another way (under the assumption, that the calculations to get the "result" are already "optimized" in terms of the necessary calculation time)?

Edit:

In my opinion Is micro-optimisation important when coding? doesn't answer my specific question, even if the goal is to optimise the code as well.

Solution

The ultimate answer, as always, is to profile both approaches and let empirical results decide. However, I would not expect a difference in speed, given modern C compiler technology. You might even make the code slower because you're tying the compiler's hands in two ways:

You're dictating a specific order for variables to appear on the stack (assuming the array is stack-allocated)
You're dictating a specific storage location for the temporaries (on the stack, rather than in registers or spilling)

It's likely that your compiler can figure this out (see LLVM's Mem2Reg pass) but why not tell your compiler what you mean? The compiler is then surely free to make software pipelining and register allocation (and stack locality) decisions for you.

I would also recommend declaring all your temporaries const because even if your compiler can infer that on its own, it's helpful for debugging your code generator.

OTHER TIPS

When these conditions are satisfied:

The array is local (confined to the function scope)
The array is only used locally
Nothing about the array's structure (e.g. pointers, references, type) are being leaked into another function (via a function call or a store),
Etc ... (I might be wrong since I'm not experienced in compiler optimizations)

The compiler may apply scalar replacement of aggregates (SRA, SRoA), which means the local array (which is used in elementwise single assignment way) approach and the local variables approach might be optimized in the same way, without requiring a stack space for hundreds of float elements or variables.

To check whether this is the case, it is necessary to take a look at the disassembly from an optimized build. There might be factors that prevent the compiler from applying SRA.

If SRA isn't performed, there is a huge problem with the array approach. It means the function will reserve a stack space enough for hundreds of float variable - the size of the array, since it determined that it couldn't "eliminate" the array. Compare this to the local scalar variables approach, whee compilers can rotate them onto registers and back to memory, and to reuse memory for already expired or disused values.

Once again, I highly recommend taking a look at the disassembly, from both optimized and unoptimized builds. You'll learn something. I did not learn compiler theory or optimizations in school. Most of what I learned about these topics come from years of reading compiler-generated disassembly, and reading expertly-written articles online.

The function might also be inlined. In which case, its optimization outcome also depends on how the caller is written.

Like others already pointed out I would not get my hopes up too high regarding the effect of your suggested changes. Your best bet is to spot calculations tracks that do not necessarily have to be executed sequentially and have those done on separate threads. This will likely get you a noticable performance gain.

However, looking at your example, it appears every step needs the result of the preceding step so in this particular case that does not seem a viable option.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange