Is optimizing certain functions with Assembler in a C/C++ program really worth it?

https://stackoverflow.com/questions/1408651

05-07-2019
|

Question

In certain areas of development such as game development, real time systems, etc., it is important to have a fast and optimized program. On the other side, modern compilers do a lot of optimization already and optimizing in Assembly can be time consuming in a world where deadlines are a factor to take into consideration.

Questions:

Is optimizing certain functions with Assembly in a C/C++ program really worth it?
Is there really a sufficient gain in performance when optimizing a C/C++ program with Assembly with today's modern compilers?

What I understand with the answers posted, any gain that can be made is important in certain areas such as embedded systems, multimedia programming (graphics, sound, etc.). Also, one needs to be capable (or have someone capable) of doing a better job in Assembly than a modern compiler. Doing some really optimized C/C++ can take less time and can do a good enough job. One last thing, learning Assembly can help understand the inner mechanics of a program and make someone a better programmer in the end.

Solution

I'd say it's not worth it. I work on software that does real-time 3D rendering (i.e., rendering without assistance from a GPU). I do make extensive use of SSE compiler intrinsics -- lots of ugly code filled with __mm_add_ps() and friends -- but I haven't needed to recode a function in assembly in a very long time.

My experience is that good modern optimizing compilers are pretty darn effective at intricate, micro-level optimizations. They'll do sophisticated loop transformations such as reordering, unrolling, pipelining, blocking, tiling, jamming, fission, and the like. They'll schedule instructions to keep the pipeline filled, vectorize simple loops, and deploy some interesting bit twiddling hacks. Modern compilers are incredibly fascinating beasts.

Can you beat them? Well, sure, given that they choose the optimizations to use by heuristics, they're bound to get it wrong sometimes. But I've found it's much better to optimize the code itself by looking at the bigger picture. Am I laying out my data structures in the most cache friendly way? Am I doing something unorthodox that misleads the compiler? Can I rewrite something a bit to give the compiler better hints? Am I better off recomputing something instead of storing it? Could inserting a prefetch help? Have I got false cache sharing somewhere? Are there small code optimization that the compiler thinks unsafe but is okay here (e.g., converting division to multiplication by the reciprocal)?

I like to work with the compiler instead of against it. Let it take care of the micro-level optimizations, so that you can focus on the mezzo-level optimizations. The important thing is to have a good idea how your compiler works so that you know where the boundaries between the two levels are.

OTHER TIPS

The only possible answer to that is: yes, if there is a performance gain that is relevant and useful.

The question should I guess really be: Can you get a meaningful performance gain by using assembly language in a C/C++ program?

The answer is yes.

The cases where you get a meaningful increase in performance have probably diminished over the last 10-20 years as libraries and compilers have improved but for an architecture like x86 in particular hand optimization in certain applications (particularly graphics related) this can be done.

But like anything don't optimize until you need to.

I would argue that algorithm optimization and writing highly efficient C (in particular) will create far more of a performance gain for less time spent than rewriting in assembly language in the vast majority of cases.

The difficulty is, can you do a better job of optimizing than the compiler can, given the architecture of modern cpus. If you are designing for a simple cpu (such as for embedded systems) then you may do reasonable optimizations, but, for a pipelined architecture the optimization is much harder as you need to understand how the pipelining works.

So, given that, if you can do this optimization, and you are working on something that the profiler tells you is too slow, and it is a part that should be as fast as possible, then yes optimizing makes sense.

Maybe

It completely depends on the individual program

You need a profile, which you get with a profiling tool, before you know. Some programs spend all their time waiting for a database, or they just don't have concentrated runtime in a small area. Without that, assembly doesn't help much.

There is a rule of thumb that 90% of the runtime happens in 10% of the code. You really want one very intense bottleneck, and not every program has that.

Also, the machines are so fast now that some of the low-hanging fruit has been eaten, so to speak, by the compilers and CPU cores. For example, say you write way better code than the compiler and cut the instruction count in half. Even then if you end up doing the same number of memory references, and if they are the bottleneck, you may not win.

Of course, you could start preloading registers in previous loop iterations, but the compiler is likely to already be trying that.

Learning assembly is really more important as a way to comprehend what the machine really is, rather than as a way to beat the compiler. But give it a try!

There is one area where assembly optimisation is still regularly performed - embedded software. These processors are usually not very powerful, and have many architectural quirks that may not be exploited by the compiler for optimisation. That said, it should still only be done for particularly tight areas of code and it has to be very well documented.

I'll assume you've profiled your code, and you've found a small loop which is taking up most of the time.

First, try recompiling with more aggressive compiler optimizations, and then re-profile. If, you've running at will all compiler optimizations turned on, and you still need more performance, then I recommend looking at the generated assembly.

What I typically do after looking at the assembly code for the function, is see how I can change the C code to get the compiler to write better assembly. The advantage of doing it this way, is I then end up with code which is tuned to run with my compiler on my processor, but is portable to other environments.

For your typical small shop developer writing an App, the performance gain/effort trade-off almost never justifies writing assembly. Even in situations where assembly can double the speed of some bottleneck, the effort is often not justifiable. In a larger company, it might be justifiable if you're the "performance guy".

However, for a library writer, even small improvements for large effort are often justified, because it saves time for thousands of developers and users who use the library in the end. Even more so for compiler writers. If you can get a 10% efficiency win in a core system library function, that can literally save millennia (or more) of battery life spread across your user base.

definitely yes!

Here is demonstration of a CRC-32 calculation which I wrote in C++, then optimized in x86 assembler using Visual Studio.

InitCRC32Table() should be called at program start. CalcCRC32() will calculate the CRC for a given memory block. Both function are implemented both in assembler and C++.

On a typical pentium machine, you will notice that the assembler CalcCRC32() function is 50% faster then the C++ code.

The assembler implementation is not MMX or SSE, but simple x86 code. The compiler will never produce a code that is as efficient as a manually crafted assembler code.

    DWORD* panCRC32Table = NULL; // CRC-32 CCITT 0x04C11DB7

    void DoneCRCTables()
    {
        if (panCRC32Table )
        {
            delete[] panCRC32Table;
            panCRC32Table= NULL;
        }
    }

    void InitCRC32Table()
    {
        if (panCRC32Table) return;
        panCRC32Table= new DWORD[256];

        atexit(DoneCRCTables);

    /*
        for (int bx=0; bx<256; bx++)
        {
            DWORD eax= bx;
            for (int cx=8; cx>0; cx--)
                if (eax & 1)
                    eax= (eax>>1) ^ 0xEDB88320;
                else
                    eax= (eax>>1)             ;
            panCRC32Table[bx]= eax;
        }
    */
            _asm cld
            _asm mov    edi, panCRC32Table
            _asm xor    ebx, ebx
        p0: _asm mov    eax, ebx
            _asm mov    ecx, 8
        p1: _asm shr    eax, 1
            _asm jnc    p2
            _asm xor    eax, 0xEDB88320           // bit-swapped 0x04C11DB7
        p2: _asm loop   p1
            _asm stosd
            _asm inc    bl
            _asm jnz    p0
    }


/*
DWORD inline CalcCRC32(UINT nLen, const BYTE* cBuf, DWORD nInitVal= 0)
{
    DWORD crc= ~nInitVal;
    for (DWORD n=0; n<nLen; n++)
        crc= (crc>>8) ^ panCRC32Table[(crc & 0xFF) ^ cBuf[n]];
    return ~crc;
}
*/
DWORD inline __declspec (naked) __fastcall CalcCRC32(UINT        nLen       ,
                                                     const BYTE* cBuf       ,
                                                     DWORD       nInitVal= 0 ) // used to calc CRC of chained bufs
{
        _asm mov    eax, [esp+4]         // param3: nInitVal
        _asm jecxz  p2                   // __fastcall param1 ecx: nLen
        _asm not    eax
        _asm push   esi
        _asm push   ebp
        _asm mov    esi, edx             // __fastcall param2 edx: cBuf
        _asm xor    edx, edx
        _asm mov    ebp, panCRC32Table
        _asm cld

    p1: _asm mov    dl , al
        _asm shr    eax, 8
        _asm xor    dl , [esi]
        _asm xor    eax, [ebp+edx*4]
        _asm inc    esi
        _asm loop   p1

        _asm pop    ebp
        _asm pop    esi
        _asm not    eax
    p2: _asm ret    4                    // eax- returned value. 4 because there is 1 param in stack
}

// test code:

#include "mmSystem.h"                      // timeGetTime
#pragma comment(lib, "Winmm.lib" )

InitCRC32Table();

BYTE* x= new BYTE[1000000];
for (int i= 0; i<1000000; i++) x[i]= 0;

DWORD d1= ::timeGetTime();

for (i= 0; i<1000; i++)
    CalcCRC32(1000000, x, 0);

DWORD d2= ::timeGetTime();

TRACE("%d\n", d2-d1);

I would say that for most people and most applications, its not worth it. Compilers are very good at optimising precisely for the architecture they're being compiled for.

That's not to say that optimising in assembly isn't unwarranted. A lot of math and low-level intensive code is often optimised by using specific CPU instructions such as SSE* etc to overcome the compiler's generated instruction/register use. In the end, the human knows precisely the point of the program. The compiler can only assume so much.

I would say that if you're not at the level where you know your own assembly will be faster, then I would let the compiler do the hard work.

Don't forget that by rewriting in assembly you lose portability. Today you don't care, but tomorrow your customers might want your software on another platform and them those assembly snippets will really hurt.

Good answers. I would say "Yes" IF you have already done performance tuning like this, and you are now in the position of

KNOWING (not guessing) that some particular hot-spot is taking more than 30% of your time,
seeing just what assembly language the compiler generated for it, after all attempts to make it generate optimal code,
knowing how to improve on that assembler code.
being willing to give up some portability.

Compilers do not know everything you know, so they are defensive and cannot take advantage of what you know.

As one example, they write subroutine entry and exit code in a general way that works no matter what the subroutine contains. You, on the other hand, may be able to hand-code little routines that dispense with frame pointers, saving registers, and stuff like that. You're risking bugs, but it is possible to beat the compiler.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow