How to declare and share a lot of variables to provide best performance

https://softwareengineering.stackexchange.com/questions/404611

07-03-2021
|

Question

My question is not about dilemma between clean code vs performance, but I want to understand exact issue with declaring variables and sharing them between functions.

I read in many threads, that from performance point of view it is good practice to declare basic variables like int, float, bool etc. as close each other as possible. So it is better to declare a lot of variables in the scope of MyClass::method(), than in the MyClass body. Is it truth?

But what if I need realy lot of variables, and I need to use them in two separated functions. For example:

void MyClass::firstMethod()
{
    // here I use a lot of variables
    secondMethod(); // and here I need all those variables
}

So it is better (still thinking about performance) to declare all those variables in the body of MyClass? Or is it better to do something like that:

void MyClass::firstMethod()
{
    float var001, var002 ... var100;
    // here I make some calculations with all variables

    secondMethod(var001, var002 ... var100);
}

void MyClass::secondMethod(float &v001, float &v002 ... float &v100)
{
    // do some calculations
}

Of course it looks stupid when method get 100 input parameters. In "Clean Code" book of Robert C. Martin I read that methods should take no more than 3 or 4 input parameters. But I wonder if it has good performance in such not typical algorithms?

I work on audio processor, and in the audio process block I need to calculate really large number of things, like calculate MID/SIDE samples, multiply by input gain, output gain, perform some filtering, dynamic analysis, multiply by gain reduction coefficient, and sends some of those variables to the graphic threads to show it on some audio monitors, and many more things. And I need to do that independently for each sample, and there are tens of thousands of samples in each second. So is it good practise to declare as much as possible in the scope, or better in class body?

Please notice, lot of my variables also need to be std::atomic, so it's not only basic variables like I said on the beginning. I am not sure how to declare all of that to provide best performance.

Also I have more than one processor blocks for different audio processes which I can choose in the real time from my application GUI. So is it better to do one big processor block with lot of if statements (or switch) to choose processor? Or better to declare some lambda for which I can assign various processes and then only call that lambda in main process block? And the problem is I have also various algorithms in each of my various processors. So then I end up with calling lambdas inside other lambdas.

I wonder how to handle all of those issues to provide best performance, but also to provide clean code as much as possible.

Solution

This is an addition to the accepted answer written by @Christophe.

Compilers want clean code, too

Except it's not the Clean Code by Robert Cecil Martin.

The clean code that compilers want to read are called Static Single Assignment form (SSA). Modern compilers use SSA as the intermediate representation.

Beliefs vs. optimizer vs. profilers

Optimizer refers to compiler stages that automatically generates somewhat optimized code for you.

Profilers are like a toolbox of scientific instruments (measurement devices) - when you use the entire toolbox correctly, and interpret the results scientifically, it gives you reliable insights (facts and observations) which guide you to tweak the code toward improved performance that fit within the same hardware constraint.

People up to intermediate skill levels should just let optimizer do the job, and to learn how to make the optimizer's job easier. Usually, this is also consistent with clean code; optimizers are typically built to recognize idiomatic clean code and then transform them into efficient generated (machine) code.

People who have 10-20 years of microarchitectural optimization experience will still occasionally resort to hand optimizing, because their needs might be beyond what optimizers can do. However, they must still love their profilers, or even build their own profilers, since this is their gold standard source of truth.

At least three profilers (or functionalities) are used:

A profiler that takes measurements at periodic timer interrupts. Usually, the measurement consist of: the instruction pointer (program counter, or the address of the instruction to be executed next), and a quick scan of the call stack. This generates a statistical estimate of "where is time spent in which pieces of code" and the call graph.
A region timer. It records the total time taken in a region of code. There are different strategies used for small regions and large regions.
A profiler that can read from model-specific registers. CPUs implement hardware performance counters that reveal information about model-specific architecture operations, such as cache misses. These information are only available by reading from model-specific registers, and typically require tools published by the CPU manufacturer.
Intrusive instrumentation profiler. These will insert a lot of code to help count the number of times a function is called. These will give an accurate number for the "call count", but their instrumentation overhead will make time-based measurement useless.
Memory profiler. Memory management (allocations and deallocations) and cache issues can affect performance.

These profilers are used in conjunction to give a full picture of software performance:

Use a combination of "time accurate" and "call count accurate" approaches to correctly attribute time spent in code.
Use a combination of "periodic sampling" and "MSR performance counter reading" to identify "hot spots", or short sequences of instructions that take up a lot of time, which may indicate weaknesses in the architecture design of the CPU for the given sequence of instructions, or that the algorithm needs to be rewritten to avoid this problematic sequence of instructions.
Use a combination of "end to end timing" and "memory profiling" to characterize the overall resource usage (total CPU cycles and peak RAM usage) of a whole benchmark scenario.

caches

Moving from intermediate skill level to an expert level of software micro-optimization requires rigorous understanding of how CPU caches, multi-core protocols and RAM work. This is too big to be covered here, but some useful information sources are:

Your university's micro-architecture textbook
Agner Fog's software optimization resources
Anandtech, in order to keep up with modern and latest CPU architectures
Wikipedia article on cache coherence protocols

caches, a recap of important points

Variables that are declared as instance members tend to live together on heap memory. If the this pointer is stored in the RCX register, the variables may be accessed with RCX+0x10, RCX+0x48, etc. If an instance method accesses some of the instance's variables, it is likely that other instance variables nearby will also be brought into the CPU cache.

CPU caches are organized into "cache lines". Older CPUs have 32-byte cache lines; modern CPUs tend to have 64-byte or 128-byte cache lines; this may yet increase for future CPU models.

Variables that are declared as local variables tend to live together on stack memory. These variables are accessed relative to the stack pointer RSP, such as RSP+0xC0.

The hot region of the stack memory is practically considered almost exclusive to a single core. Variables on the stack memory have local scope, which means their lifetime is enclosed within the function lifetime. Therefore, they only exist while the current function is executing, and/or when the current function calls some other functions. All these actions typically happen within the current CPU core.

Code that passes data from one core to another code will typically allocate the data on the heap, so that the data's ownership (in C++ sense) is shared between two cores (with std::shared_ptr), with the meaning that the data will remain alive for as long as either core has remaining work to do with the data.

prefetching

CPUs are able to predictively prefetch adjacent CPU cache lines; sometimes they do this correctly. This means variables that are spatially adjacent to each other and semantically related to each other will automatically benefit from modern CPU prefetch logic. It is now almost universally accepted that you don't need to use a "prefetch hint", or extraneous instructions whose only purpose is to hint the prefetch. At best it is ignored, though you'd still need to pay for the instruction decode cost; at worst it interrupts what the automatic prefetch logic is already doing.

the consequence of multiple cores competing access to the same region of memory

Yes, it can slow things down. Tremendously. (By a hundredfold.)

(Side note.) This is where "multi-core" and "multi-socket" makes a huge difference. If your software is used on multi-socket machines, the application needs to implement core affinity and socket affinity, sometimes combined into a "NUMA affinity" setting.

If multiple cores are reading from it, it may be worth making copies of the same data, so that each core reads its copy exclusively.

If multiple cores are collaborating on it (with multiple readers and writers), one may choose: consolidate all the work on a single core; or implement a multicore queue-based work distribution architecture similar to LMAX Disruptor (as narrated by Martin Fowler).

Working around this problem may require one to experiment with different work size (data size) granularity. This is where software architecture (which is on-topic for this site) may enhance or impede software performance engineering.

To find out if this is happening, the first step is to do a coarse-to-fine approach to identify hot spots in code; the second step is to apply your common sense to decide whether a hot spot looks unreasonable, i.e. it is unbelievably slow given the small size and the simplicity of the code.

The third step is to extract and amplify that code (for example, to put that code in a for-loop that repeats that 1000 or 1000000 times), and use an MSR performance counter profiling tool to observe anomaly. This "amplify" technique is comparable to the polymerase chain reaction in genome detection; it ensures that what you're going to measure is 99% attributable to the code you've chosen to amplify.

Once the cause of performance anomaly is understood, the code can be modified to workaround the issue.

atomic, atomic, oh my

If you find yourself drown in std::atomic, it is time to re-read the textbook chapters and articles on cache coherence.

Sometimes, it might be the case that std::atomic might not be suitable for one's programming needs; in which case one may need to use compiler-specific atomic primitives. A lot of times, people find out that certain usage doesn't need any specific atomic primitives, other than a compiler memory barrier (fence).

side note on low-level optimizations for algorithm performance

Don't forget SIMD and GPGPU.

OTHER TIPS

Beliefs vs. optimizer

There are a lot of beliefs out there that have been made obsolete by progress of optimizers.

So you’d better write the code, then profile it, and if it appears that the optimizer could not get rid of a bottleneck, then only should you have a second thought at it.

Instance vs local variables

Again, the driving force should be class design, and not micro-optimization. This being said both thoughts are not always mutually exclusive.

The main difference is that :

local variables are easier to master, since they are used only in one function and you can easier find out where they could be changed.
local variables can be optimized away if it appears that they are not (or no longer) needed. For instance a calculation result needed for parameter passing could stay in CPU registers and then ignored, whereas instance variables will always need to be stored at a moment. But we are speaking of a single store here: not an operation that should significantly influence your overall performance.

So older compilers may perform a better job on local variables. But modern optimizers use global optimization and are able to follow the flow of values across several functions. Here the difference matters much less.

Your design comes first

You mention the need for atomic. This means that you envisage several threads working on the same variables. For me, this makes these variables prime suspects when it comes to looking at instance variables.

Also you express complex calculations. These calculations should not be done 10 times because calling a function would be more readable: again, if it is something that belongs to your instance and not a single function, make it an instance variable.

Throw-it away variables can be local to the function. The others are well managed at the level of the instance. Deciding which variable is instance and which is local function variable is primarily a design question. Your design needs comes first. After you can clean-code and micro-optimize. Not the other way round!

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange