Can modern compilers unroll `for` loops expressed using begin and end iterators

https://stackoverflow.com/questions/11529006

21-06-2021
|

Question

Consider the following code

 vector<double> v;
 // fill v
 const vector<double>::iterator end =v.end();
 for(vector<double>::iterator i = v.bgin(); i != end; ++i) {
   // do stuff
 }

Are compilers like g++, clang++, icc able to unroll loops like this. Unfortunately I do not know assembly to be able verify from the output whether the loop gets unrolled or not. (and I only have access to g++.)

To me it seems that this will require more smartness than usual on behalf of the compiler, first to deduce that the iterator is a random access iterator, and then figure out the number of times the loop is executed. Can compilers do this when optimization is enabled ?

Thanks for your replies, and before some of you start lecturing about premature optimization, this is an excercise in curiosity.

Solution

To me it seems that this will require more smartness than usual on behalf of the compiler, first to deduce that the iterator is a random access iterator, and then figure out the number of times the loop is executed.

The STL, being comprised entirely of templates, has all the code inline. So, random access iterators reduce to pointers already when the compiler begins to apply optimizations. One of the reasons the STL was created was so that there would be less need for a programmer to outwit the compiler. You should rely on the STL to do the right thing until proven otherwise.

Of course, it is still up to you to choose the proper tool from the STL to use...

Edit: There was discussion about whether g++ does any loop unrolling. On the versions that I am using, loop unrolling is not part of -O, -O2, or -O3, and I get identical assembly for the latter two levels with the following code:

void foo (std::vector<int> &v) {
    volatile int c = 0;
    const std::vector<int>::const_iterator end = v.end();
    for (std::vector<int>::iterator i = v.begin(); i != end; ++i) {
        *i = c++;
    }
}

With the corresponding assembly -O2 assembly:

_Z3fooRSt6vectorIiSaIiEE:
.LFB435:
        movq    8(%rdi), %rcx
        movq    (%rdi), %rax
        movl    $0, -4(%rsp)
        cmpq    %rax, %rcx
        je      .L4
        .p2align 4,,10
        .p2align 3
.L3:
        movl    -4(%rsp), %edx
        movl    %edx, (%rax)
        addq    $4, %rax
        addl    $1, %edx
        cmpq    %rax, %rcx
        movl    %edx, -4(%rsp)
        jne     .L3
.L4:
        rep
        ret

With the -funroll-loops option added, the function expands into something much much larger. But, the documentation warns about this option:

Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

As a further argument to dissuade you from unrolling loops yourself, I'll finish this answer with an illustration of applying Duff's Device to the foo function above:

void foo_duff (std::vector<int> &v) {
    volatile int c = 0;
    const std::vector<int>::const_iterator end = v.end();
    std::vector<int>::iterator i = v.begin();
    switch ((end - i) % 4) do {
    case 0: *i++ = c++;
    case 3: *i++ = c++;
    case 2: *i++ = c++;
    case 1: *i++ = c++;
    } while (i != end);
}

GCC has another loop optimization flag:

-ftree-loop-optimize
Perform loop optimizations on trees. This flag is enabled by default at -O and higher.

So, the -O option enables simple loop optimizations for the innermost loops, including complete loop unrolling (peeling) for loops with a fixed number of iterations. (Thanks to doc for pointing this out to me.)

OTHER TIPS

I would propose that whether or not the compiler CAN unroll the loop, with modern pipelined architectures and caches, unless your "do stuff" is trivial, there is little benefit in doing so, and in many cases doing so would be a performance HIT instead of a boon. If your "do stuff" is nontrivial, unrolling the loop will create multiple copies of this nontrivial code, which will take extra time to load into the cache, significantly slowing down the first iteration through the unrolled loop. At the same time, it will evict more code from the cache, which may have been necessary for performing the "do stuff" if it makes any function calls, which would then need to be reloaded into the cache again. The purpose for unrolling loops made a lot of sense before cacheless pipelined non-branch-predictive architectures, with the goal being to reduce the overhead associated with the loop logic. Nowadays with cache-based pipelined branch-predictive hardware, your cpu will be pipelined well into the next loop iteration, speculatively executing the loop code again, by the time you detect the i==end exit condition, at which point the processor will throw out that final speculatively-executed set of results. In such an architecture, loop unrolling makes very little sense. It would further bloat code for virtually no benefit.

The short answer is yes. It will unroll as much as it can. In your case, it depends how you define end obviously (I assume your example is generic). Not only will most modern compilers unroll, but they will also vectorize and do other optimizations that will often blow your own solutions out of the water.

So what I'm saying is don't prematurely optimize! Just kidding :)

Simple answer: generally NO! At least when it comes to complete loop unrolling.

Let's test loop unrolling on this simple, dirty-coded (for testing purposes) structure.

struct Test
{
    Test(): begin(arr), end(arr + 4) {}

    double * begin;
    double * end;
    double arr[4];
};

First let's take counted loop and compile it without any optimizations.

double counted(double param, Test & d)
{
    for (int i = 0; i < 4; i++)
        param += d.arr[i];
    return param;
}

Here's what gcc 4.9 produces.

counted(double, Test&):
    pushq   %rbp
    movq    %rsp, %rbp
    movsd   %xmm0, -24(%rbp)
    movq    %rdi, -32(%rbp)
    movl    $0, -4(%rbp)
    jmp .L2
.L3:
    movq    -32(%rbp), %rax
    movl    -4(%rbp), %edx
    movslq  %edx, %rdx
    addq    $2, %rdx
    movsd   (%rax,%rdx,8), %xmm0
    movsd   -24(%rbp), %xmm1
    addsd   %xmm0, %xmm1
    movq    %xmm1, %rax
    movq    %rax, -24(%rbp)
    addl    $1, -4(%rbp)
.L2:
    cmpl    $3, -4(%rbp)
    jle .L3
    movq    -24(%rbp), %rax
    movq    %rax, -40(%rbp)
    movsd   -40(%rbp), %xmm0
    popq    %rbp
    ret

As expected loop hasn't been unrolled and, since no optimizations were performed, code is generally very verbose. Now let's turn on -O3 flag. Produced disassembly:

counted(double, Test&):
    addsd   16(%rdi), %xmm0
    addsd   24(%rdi), %xmm0
    addsd   32(%rdi), %xmm0
    addsd   40(%rdi), %xmm0
    ret

Voila, loop has been unrolled this time.

Now let's take a look at iterated loop. Function containing the loop will look like this.

double iterated(double param, Test & d)
{
  for (double * it = d.begin; it != d.end; ++it)
    param += *it;
  return param;
}

Still using -O3 flag, let's take a look at disassembly.

iterated(double, Test&):
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    cmpq    %rdx, %rax
    je  .L3
.L4:
    addsd   (%rax), %xmm0
    addq    $8, %rax
    cmpq    %rdx, %rax
    jne .L4
.L3:
    rep ret

Code looks better than in the very first case, because optimizations were performed, but loop hasn't been unrolled this time!

What about funroll-loops and funroll-all-loops flags? They will produce result similar to this

iterated(double, Test&):
    movq    (%rdi), %rsi
    movq    8(%rdi), %rcx
    cmpq    %rcx, %rsi
    je  .L3
    movq    %rcx, %rdx
    leaq    8(%rsi), %rax
    addsd   (%rsi), %xmm0
    subq    %rsi, %rdx
    subq    $8, %rdx
    shrq    $3, %rdx
    andl    $7, %edx
    cmpq    %rcx, %rax
    je  .L43
    testq   %rdx, %rdx
    je  .L4
    cmpq    $1, %rdx
    je  .L29
    cmpq    $2, %rdx
    je  .L30
    cmpq    $3, %rdx
    je  .L31
    cmpq    $4, %rdx
    je  .L32
    cmpq    $5, %rdx
    je  .L33
    cmpq    $6, %rdx
    je  .L34
    addsd   (%rax), %xmm0
    leaq    16(%rsi), %rax
.L34:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L33:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L32:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L31:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L30:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L29:
    addsd   (%rax), %xmm0
    addq    $8, %rax
    cmpq    %rcx, %rax
    je  .L44
.L4:
    addsd   (%rax), %xmm0
    addq    $64, %rax
    addsd   -56(%rax), %xmm0
    addsd   -48(%rax), %xmm0
    addsd   -40(%rax), %xmm0
    addsd   -32(%rax), %xmm0
    addsd   -24(%rax), %xmm0
    addsd   -16(%rax), %xmm0
    addsd   -8(%rax), %xmm0
    cmpq    %rcx, %rax
    jne .L4
.L3:
    rep ret
.L44:
    rep ret
.L43:
    rep ret

Compare results with unrolled loop for counted loop. It's clearly not the same. What we see here is that gcc divided the loop into 8 element chunks. This can increase performance in some cases, because loop exit condition is checked once per 8 normal loop iterations. With additional flags vectorization could be also performed. But it isn't complete loop unrolling.

Iterated loop will be unrolled however if Test object is not a function argument.

double iteratedLocal(double param)
{
  Test d;
  for (double * it = d.begin; it != d.end; ++it)
    param += *it;
  return param;
}

Disassembly produced with only -O3 flag:

iteratedLocal(double):
    addsd   -40(%rsp), %xmm0
    addsd   -32(%rsp), %xmm0
    addsd   -24(%rsp), %xmm0
    addsd   -16(%rsp), %xmm0
    ret

As you can see loop has been unrolled. This is because compiler can now safely assume that end has fixed value, while it couldn't predict that for function argument.

Test structure is statically allocated however. Things are more complicated with dynamically allocated structures like std::vector. From my observations on modified Test structure, so that it ressembles dynamically allocated container, it looks like gcc tries its best to unroll loops, but in most cases generated code is not as simple as one above.

As you ask for other compilers, here's output from clang 3.4.1 (-O3 flag)

counted(double, Test&):                      # @counted(double, Test&)
    addsd   16(%rdi), %xmm0
    addsd   24(%rdi), %xmm0
    addsd   32(%rdi), %xmm0
    addsd   40(%rdi), %xmm0
    ret

iterated(double, Test&):                     # @iterated(double, Test&)
    movq    (%rdi), %rax
    movq    8(%rdi), %rcx
    cmpq    %rcx, %rax
    je  .LBB1_2
.LBB1_1:                                # %.lr.ph
    addsd   (%rax), %xmm0
    addq    $8, %rax
    cmpq    %rax, %rcx
    jne .LBB1_1
.LBB1_2:                                # %._crit_edge
    ret

iteratedLocal(double):                     # @iteratedLocal(double)
    leaq    -32(%rsp), %rax
    movq    %rax, -48(%rsp)
    leaq    (%rsp), %rax
    movq    %rax, -40(%rsp)
    xorl    %eax, %eax
    jmp .LBB2_1
.LBB2_2:                                # %._crit_edge4
    movsd   -24(%rsp,%rax), %xmm1
    addq    $8, %rax
.LBB2_1:                                # =>This Inner Loop Header: Depth=1
    movaps  %xmm0, %xmm2
    cmpq    $24, %rax
    movaps  %xmm1, %xmm0
    addsd   %xmm2, %xmm0
    jne .LBB2_2
    ret

Intel's icc 13.01 (-O3 flag)

counted(double, Test&):
        addsd     16(%rdi), %xmm0                               #24.5
        addsd     24(%rdi), %xmm0                               #24.5
        addsd     32(%rdi), %xmm0                               #24.5
        addsd     40(%rdi), %xmm0                               #24.5
        ret                                                     #25.10
iterated(double, Test&):
        movq      (%rdi), %rdx                                  #30.26
        movq      8(%rdi), %rcx                                 #30.41
        cmpq      %rcx, %rdx                                    #30.41
        je        ..B3.25       # Prob 50%                      #30.41
        subq      %rdx, %rcx                                    #30.7
        movb      $0, %r8b                                      #30.7
        lea       7(%rcx), %rax                                 #30.7
        sarq      $2, %rax                                      #30.7
        shrq      $61, %rax                                     #30.7
        lea       7(%rax,%rcx), %rcx                            #30.7
        sarq      $3, %rcx                                      #30.7
        cmpq      $16, %rcx                                     #30.7
        jl        ..B3.26       # Prob 10%                      #30.7
        movq      %rdx, %rdi                                    #30.7
        andq      $15, %rdi                                     #30.7
        je        ..B3.6        # Prob 50%                      #30.7
        testq     $7, %rdi                                      #30.7
        jne       ..B3.26       # Prob 10%                      #30.7
        movl      $1, %edi                                      #30.7
..B3.6:                         # Preds ..B3.5 ..B3.3
        lea       16(%rdi), %rax                                #30.7
        cmpq      %rax, %rcx                                    #30.7
        jl        ..B3.26       # Prob 10%                      #30.7
        movq      %rcx, %rax                                    #30.7
        xorl      %esi, %esi                                    #30.7
        subq      %rdi, %rax                                    #30.7
        andq      $15, %rax                                     #30.7
        negq      %rax                                          #30.7
        addq      %rcx, %rax                                    #30.7
        testq     %rdi, %rdi                                    #30.7
        jbe       ..B3.11       # Prob 2%                       #30.7
..B3.9:                         # Preds ..B3.7 ..B3.9
        addsd     (%rdx,%rsi,8), %xmm0                          #31.9
        incq      %rsi                                          #30.7
        cmpq      %rdi, %rsi                                    #30.7
        jb        ..B3.9        # Prob 82%                      #30.7
..B3.11:                        # Preds ..B3.9 ..B3.7
        pxor      %xmm6, %xmm6                                  #28.12
        movaps    %xmm6, %xmm7                                  #28.12
        movaps    %xmm6, %xmm5                                  #28.12
        movsd     %xmm0, %xmm7                                  #28.12
        movaps    %xmm6, %xmm4                                  #28.12
        movaps    %xmm6, %xmm3                                  #28.12
        movaps    %xmm6, %xmm2                                  #28.12
        movaps    %xmm6, %xmm1                                  #28.12
        movaps    %xmm6, %xmm0                                  #28.12
..B3.12:                        # Preds ..B3.12 ..B3.11
        addpd     (%rdx,%rdi,8), %xmm7                          #31.9
        addpd     16(%rdx,%rdi,8), %xmm6                        #31.9
        addpd     32(%rdx,%rdi,8), %xmm5                        #31.9
        addpd     48(%rdx,%rdi,8), %xmm4                        #31.9
        addpd     64(%rdx,%rdi,8), %xmm3                        #31.9
        addpd     80(%rdx,%rdi,8), %xmm2                        #31.9
        addpd     96(%rdx,%rdi,8), %xmm1                        #31.9
        addpd     112(%rdx,%rdi,8), %xmm0                       #31.9
        addq      $16, %rdi                                     #30.7
        cmpq      %rax, %rdi                                    #30.7
        jb        ..B3.12       # Prob 82%                      #30.7
        addpd     %xmm6, %xmm7                                  #28.12
        addpd     %xmm4, %xmm5                                  #28.12
        addpd     %xmm2, %xmm3                                  #28.12
        addpd     %xmm0, %xmm1                                  #28.12
        addpd     %xmm5, %xmm7                                  #28.12
        addpd     %xmm1, %xmm3                                  #28.12
        addpd     %xmm3, %xmm7                                  #28.12
        movaps    %xmm7, %xmm0                                  #28.12
        unpckhpd  %xmm7, %xmm0                                  #28.12
        addsd     %xmm0, %xmm7                                  #28.12
        movaps    %xmm7, %xmm0                                  #28.12
..B3.14:                        # Preds ..B3.13 ..B3.26
        lea       1(%rax), %rsi                                 #30.7
        cmpq      %rsi, %rcx                                    #30.7
        jb        ..B3.25       # Prob 50%                      #30.7
        subq      %rax, %rcx                                    #30.7
        cmpb      $1, %r8b                                      #30.7
        jne       ..B3.17       # Prob 50%                      #30.7
..B3.16:                        # Preds ..B3.17 ..B3.15
        xorl      %r8d, %r8d                                    #30.7
        jmp       ..B3.21       # Prob 100%                     #30.7
..B3.17:                        # Preds ..B3.15
        cmpq      $2, %rcx                                      #30.7
        jl        ..B3.16       # Prob 10%                      #30.7
        movq      %rcx, %r8                                     #30.7
        xorl      %edi, %edi                                    #30.7
        pxor      %xmm1, %xmm1                                  #28.12
        lea       (%rdx,%rax,8), %rsi                           #31.19
        andq      $-2, %r8                                      #30.7
        movsd     %xmm0, %xmm1                                  #28.12
..B3.19:                        # Preds ..B3.19 ..B3.18
        addpd     (%rsi,%rdi,8), %xmm1                          #31.9
        addq      $2, %rdi                                      #30.7
        cmpq      %r8, %rdi                                     #30.7
        jb        ..B3.19       # Prob 82%                      #30.7
        movaps    %xmm1, %xmm0                                  #28.12
        unpckhpd  %xmm1, %xmm0                                  #28.12
        addsd     %xmm0, %xmm1                                  #28.12
        movaps    %xmm1, %xmm0                                  #28.12
..B3.21:                        # Preds ..B3.20 ..B3.16
        cmpq      %rcx, %r8                                     #30.7
        jae       ..B3.25       # Prob 2%                       #30.7
        lea       (%rdx,%rax,8), %rax                           #31.19
..B3.23:                        # Preds ..B3.23 ..B3.22
        addsd     (%rax,%r8,8), %xmm0                           #31.9
        incq      %r8                                           #30.7
        cmpq      %rcx, %r8                                     #30.7
        jb        ..B3.23       # Prob 82%                      #30.7
..B3.25:                        # Preds ..B3.23 ..B3.21 ..B3.14 ..B3.1
        ret                                                     #32.14
..B3.26:                        # Preds ..B3.2 ..B3.6 ..B3.4    # Infreq
        movb      $1, %r8b                                      #30.7
        xorl      %eax, %eax                                    #30.7
        jmp       ..B3.14       # Prob 100%                     #30.7
iteratedLocal(double):
        lea       -8(%rsp), %rax                                #8.13
        lea       -40(%rsp), %rdx                               #7.11
        cmpq      %rax, %rdx                                    #33.41
        je        ..B4.15       # Prob 50%                      #33.41
        movq      %rax, -48(%rsp)                               #32.12
        movq      %rdx, -56(%rsp)                               #32.12
        xorl      %eax, %eax                                    #33.7
..B4.13:                        # Preds ..B4.11 ..B4.13
        addsd     -40(%rsp,%rax,8), %xmm0                       #34.9
        incq      %rax                                          #33.7
        cmpq      $4, %rax                                      #33.7
        jb        ..B4.13       # Prob 82%                      #33.7
..B4.15:                        # Preds ..B4.13 ..B4.1
        ret                                                     #35.14

To avoid misunderstandings. If counted loop condition would rely on external parameter like this one.

double countedDep(double param, Test & d)
{
    for (int i = 0; i < d.size; i++)
        param += d.arr[i];
    return param;
}

Such loop also will not be unrolled.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow