Arithmetic vs boolean operations

https://stackoverflow.com/questions/11811726

24-06-2021
|

Question

I've come across this piece of code in some forum:

if ( a * b * c * d == 0 ) ....

and the owner claims this is a tad faster than

if (a == 0 || b == 0 || c == 0 || d == 0)

These variables are defined as:

int a, b, c, d;

And their absolute values are guaranteed to be less than or equal to 100. (So we could ignore the possibility of overflowing)

If we just ignore the readability and just focus on the performance, is the claim really correct?

It seems to me that the second approach might actually be faster since you could take advantage of 'short-circuit' sometimes. But then, what-do-I-know?!

La solution

The C standard says nothing about performance. The question of whether

if ( a * b * c * d == 0 )

is faster than

if (a == 0 || b == 0 || c == 0 || d == 0)

is meaningful only in the context of a particular compiler generating code running on a particular machine. The only real way to compare them is to measure the performance on your own system, or on whatever system you're interested in.

Still, we can speculate about what the performance is likely to be.

As you said, a, b, c, and d are objects of type int. You also said they're in the range [-100,+100] -- but the compiler doesn't necessarily know that.

A compiler is free to replace any expression with code that does the same thing.

Multiplication is a relatively complex operation, and is likely to be slower than, say, addition or comparison. A compiler could recognize that the first condition will be true if any of the four variables has the value 0, and replace the multiplications with whatever happens to be faster. But each optimization a compiler performs has to be explicitly programmed by the compiler's developers, and this particular pattern isn't likely to be common enough for it to be worth the effort of recognizing it.

You say the values are small enough that overflow isn't an issue. In fact, you can't portably make that assumption; INT_MAX can be as small as 32767. But the compiler knows how big an int is on the system for which it's generating code. Still, unless it has information about the values of a, b, c, and d, it can't assume that there will be no overflow.

Except that yes, actually, it can make that assumption. The behavior of signed integer overflow is undefined. That gives an optimizing compiler permission to assume that overflow can't occur (if it does, whatever behavior the program exhibits is valid anyway).

So yes, a compiler could replace the multiplications with something simpler, but it's not likely to do so.

As for the other expression, a == 0 || b == 0 || c == 0 || d == 0, the || operator has short-circuit semantics; if the left operand is true (non-zero), then the right operand isn't evaluated. And that kind of conditional code can create performance issues due to CPU pipeline issues. Since none of the subexpressions have side effects (assuming none of the variables are declared volatile), the compiler can evaluate all four subexpressions, perhaps in parallel, if that's faster.

A quick experiment shows that gcc -O3 for x86 doesn't perform either optimization. For the first expression, it generates code that performs three multiplications. For the second, it generates conditional branches, implementing the canonical short-circuit evaluations (I don't know whether avoiding that would be faster or not).

Your best bet is to write reasonable code that's as straightforward as possible, both because it makes your source code easier to read and maintain, and because it's likely to give the compiler a better chance to recognize patterns and perform optimizations. If you try to do fancy micro-optimizations in your source code, you're as likely to hinder the compiler's optimizations as you are to help.

Don't worry too much about how fast your code is unless you've measured it and found it to be too slow. If you need your code to be faster, first concentrate on improved algorithms and data structures. And only if that fails, consider source-level micro-optimizations.

The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.

-- Michael A. Jackson

Autres conseils

The two are not equivalent. For example on my machine (32-bit x86 MSVC) if a, b, c and d are all equal to 0x100 then the first test will pass but the second condition will not.

Also note that multiplication is a costly operation, so the first version won't necessarily be faster.

EDIT: Code generated for the first version:

00401000 8B 44 24 04      mov         eax,dword ptr [esp+4] 
00401004 0F AF 44 24 08   imul        eax,dword ptr [esp+8] 
00401009 0F AF 44 24 0C   imul        eax,dword ptr [esp+0Ch] 
0040100E 0F AF 44 24 10   imul        eax,dword ptr [esp+10h] 
00401013 85 C0            test        eax,eax 
00401015 75 07            jne         f1+1Eh (40101Eh) 
00401017 ...

Code generated for the second version:

00401020 83 7C 24 04 00   cmp         dword ptr [esp+4],0 
00401025 74 15            je          f2+1Ch (40103Ch) 
00401027 83 7C 24 08 00   cmp         dword ptr [esp+8],0 
0040102C 74 0E            je          f2+1Ch (40103Ch) 
0040102E 83 7C 24 0C 00   cmp         dword ptr [esp+0Ch],0 
00401033 74 07            je          f2+1Ch (40103Ch) 
00401035 83 7C 24 10 00   cmp         dword ptr [esp+10h],0 
0040103A 75 07            jne         f2+23h (401043h) 
0040103C ...

Benchmarks on my machine (in nanoseconds): the first version runs in about 1.83 ns and the second in about 1.39 ns. The values of a, b, c and d didn't change during each run, so apparently the branch predictor could predict 100% of the branches.

So as usual with which is faster questions, is what have you tried so far? Did you compile and disassemble and see what happens?

unsigned int mfun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
    if ( a * b * c * d == 0 ) return(7);
    else return(11);
}

unsigned int ofun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
    if (a == 0 || b == 0 || c == 0 || d == 0) return(7);
    else return(11);
}

for arm one compiler gives this

00000000 <mfun>:
   0:   e0010190    mul r1, r0, r1
   4:   e0020291    mul r2, r1, r2
   8:   e0110293    muls    r1, r3, r2
   c:   13a0000b    movne   r0, #11
  10:   03a00007    moveq   r0, #7
  14:   e12fff1e    bx  lr

00000018 <ofun>:
  18:   e3500000    cmp r0, #0
  1c:   13510000    cmpne   r1, #0
  20:   0a000004    beq 38 <ofun+0x20>
  24:   e3520000    cmp r2, #0
  28:   13530000    cmpne   r3, #0
  2c:   13a0000b    movne   r0, #11
  30:   03a00007    moveq   r0, #7
  34:   e12fff1e    bx  lr
  38:   e3a00007    mov r0, #7
  3c:   e12fff1e    bx  lr

so the equals and ors have short circuits (which are themselves costly) but the worst path takes longer so the performance is erratic, the multiply performance is more deterministic and less erratic. By inspection the multiply solution should be faster for the above code.

mips gave me this

00000000 <mfun>:
   0:   00a40018    mult    a1,a0
   4:   00002012    mflo    a0
    ...
  10:   00860018    mult    a0,a2
  14:   00002012    mflo    a0
    ...
  20:   00870018    mult    a0,a3
  24:   00002012    mflo    a0
  28:   10800003    beqz    a0,38 <mfun+0x38>
  2c:   00000000    nop
  30:   03e00008    jr  ra
  34:   2402000b    li  v0,11
  38:   03e00008    jr  ra
  3c:   24020007    li  v0,7

00000040 <ofun>:
  40:   10800009    beqz    a0,68 <ofun+0x28>
  44:   00000000    nop
  48:   10a00007    beqz    a1,68 <ofun+0x28>
  4c:   00000000    nop
  50:   10c00005    beqz    a2,68 <ofun+0x28>
  54:   00000000    nop
  58:   10e00003    beqz    a3,68 <ofun+0x28>
  5c:   00000000    nop
  60:   03e00008    jr  ra
  64:   2402000b    li  v0,11
  68:   03e00008    jr  ra
  6c:   24020007    li  v0,7

Unless the branches are too costly the equals and ors looks faster.

Openrisc 32

00000000 <mfun>:
   0:   e0 64 1b 06     l.mul r3,r4,r3
   4:   e0 a3 2b 06     l.mul r5,r3,r5
   8:   e0 c5 33 06     l.mul r6,r5,r6
   c:   bc 26 00 00     l.sfnei r6,0x0
  10:   0c 00 00 04     l.bnf 20 <mfun+0x20>
  14:   9d 60 00 0b     l.addi r11,r0,0xb
  18:   44 00 48 00     l.jr r9
  1c:   15 00 00 00     l.nop 0x0
  20:   44 00 48 00     l.jr r9
  24:   9d 60 00 07     l.addi r11,r0,0x7

00000028 <ofun>:
  28:   e0 e0 20 02     l.sub r7,r0,r4
  2c:   e0 87 20 04     l.or r4,r7,r4
  30:   bd 64 00 00     l.sfgesi r4,0x0
  34:   10 00 00 10     l.bf 74 <ofun+0x4c>
  38:   e0 80 18 02     l.sub r4,r0,r3
  3c:   e0 64 18 04     l.or r3,r4,r3
  40:   bd 63 00 00     l.sfgesi r3,0x0
  44:   10 00 00 0c     l.bf 74 <ofun+0x4c>
  48:   e0 60 30 02     l.sub r3,r0,r6
  4c:   e0 c3 30 04     l.or r6,r3,r6
  50:   bd 66 00 00     l.sfgesi r6,0x0
  54:   10 00 00 08     l.bf 74 <ofun+0x4c>
  58:   e0 60 28 02     l.sub r3,r0,r5
  5c:   e0 a3 28 04     l.or r5,r3,r5
  60:   bd 85 00 00     l.sfltsi r5,0x0
  64:   0c 00 00 04     l.bnf 74 <ofun+0x4c>
  68:   9d 60 00 0b     l.addi r11,r0,0xb
  6c:   44 00 48 00     l.jr r9
  70:   15 00 00 00     l.nop 0x0
  74:   44 00 48 00     l.jr r9
  78:   9d 60 00 07     l.addi r11,r0,0x7

this depends on the implementation of multiply, if it is one clock then the multiplies have it.

If your hardware doesnt support multiply then you have to make a call to have it simulated

00000000 <mfun>:
   0:   0b 12           push    r11     
   2:   0a 12           push    r10     
   4:   09 12           push    r9      
   6:   09 4d           mov r13,    r9  
   8:   0b 4c           mov r12,    r11 
   a:   0a 4e           mov r14,    r10 
   c:   0c 4f           mov r15,    r12 
   e:   b0 12 00 00     call    #0x0000 
  12:   0a 4e           mov r14,    r10 
  14:   0c 49           mov r9, r12 
  16:   b0 12 00 00     call    #0x0000 
  1a:   0a 4e           mov r14,    r10 
  1c:   0c 4b           mov r11,    r12 
  1e:   b0 12 00 00     call    #0x0000 
  22:   0e 93           tst r14     
  24:   06 24           jz  $+14        ;abs 0x32
  26:   3f 40 0b 00     mov #11,    r15 ;#0x000b
  2a:   39 41           pop r9      
  2c:   3a 41           pop r10     
  2e:   3b 41           pop r11     
  30:   30 41           ret         
  32:   3f 40 07 00     mov #7, r15 ;#0x0007
  36:   39 41           pop r9      
  38:   3a 41           pop r10     
  3a:   3b 41           pop r11     
  3c:   30 41           ret         

0000003e <ofun>:
  3e:   0f 93           tst r15     
  40:   09 24           jz  $+20        ;abs 0x54
  42:   0e 93           tst r14     
  44:   07 24           jz  $+16        ;abs 0x54
  46:   0d 93           tst r13     
  48:   05 24           jz  $+12        ;abs 0x54
  4a:   0c 93           tst r12     
  4c:   03 24           jz  $+8         ;abs 0x54
  4e:   3f 40 0b 00     mov #11,    r15 ;#0x000b
  52:   30 41           ret         
  54:   3f 40 07 00     mov #7, r15 ;#0x0007
  58:   30 41

You would hope that the two are equivalent, and from a pure mathematical sense they should be, to get a result of the multiplies to be zero one operand needs to be zero. problem is this is software for a processor, you can easily overflow on a multiply and have non-zero operands and still get zero so to properly implement the code the multiplies have to happen.

because of the cost of mul and divide in particular you should avoid them as much as possible in your software, your multiply solution in this case for the two solutions to be equivalent would require even more code to detect or prevent the overflow cases that can lead to a false positive. Yes, many processors perform mul in one clock, and divide as well, the reason why you dont see divide, and sometimes dont see mul implemented in the instruction set is because the chip real estate required, the expense is now power, heat, the cost of the part, etc. So mul and divide remain expensive, not limited to these of course but they do create long poles in the tent as to the performance of the part, the clock rate, folks want single clock operation not realizing that one instruction may slow the whole chip down, allowing it to be multi-clock might bring your overall clock rate up. so many things are long poles in the tent, so removing mul might not change performance, it all depends...

if ( a * b * c * d == 0 ) compiles to (without optimizations)

movl   16(%esp), %eax
imull  20(%esp), %eax
imull  24(%esp), %eax
imull  28(%esp), %eax
testl  %eax, %eax
jne .L3

and if (a == 0 || b == 0 || c == 0 || d == 0) compiles to

cmpl   $0, 16(%esp)
je  .L2
cmpl    $0, 20(%esp)
je  .L2
cmpl    $0, 24(%esp)
je .L2
cmpl    $0, 28(%esp)
jne .L4

Yes when the if instruction fail, cause in this case we do at most 4 comparisons (Operations) in the second instruction, and for the first instruction we always do 4 operations.

Edit : Explanation

The second if instruction is always faster than the first one:

Suppose that : a = 1, b =2, c =0 and d = 4, in this case :

For the first instruction : we have 3 multiplications and a comparison = 4 operations
For the second if instruction : we compare a to 0 (result KO) then b to 0 (again KO) and c to 0 (OK) = 3 operations.

This is a simple program that output the execution time for this 2 instructions, you can modify a, b, c and d and passé the number of the instruction as argument.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* This is a test program to demonstrate that the second if is faster always than the first one*/
int main(int argc, char **argv)
{
        int i;
        int a = 1;
        int b = 2;
        int c = 0;
        int d = 4;
        int instruction_number;
        clock_t begin, end;
        double time_spent;

        begin = clock();

        if (argc != 2)
        {
                fprintf(stderr, "Usage : ./a.out if_instruction_number (1 or 2)\n");

                exit(EXIT_FAILURE);
        }

        instruction_number = atoi(argv[1]);

        for (i = 1; i < 100000; i++)
        {
                switch (instruction_number)
                {
                        case 1:
                                fprintf(stdout, "First if instruction : \n");
                                if (a * b * c * d == 0)
                                        fprintf(stdout, "1st instruction\n");
                                break;
                        case 2:
                                fprintf(stdout, "Second if instruction : \n");
                                if (a == 0 || b == 0 || c == 0 || d == 0)
                                        fprintf(stdout, "2nd instruction\n");
                                break;
                        default:
                                break;
                }
        }
        end = clock();
        time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
        fprintf(stdout, "Time to accomplish %d instruction ---> %f\n", instruction_number, time_spent);

        return 0;
}

Hope this help.

Regards.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow