Вопрос

I write a lot of vectorized loops, so 1 common idiom is

volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16)   // process all elements with whole vector
{
  int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)

But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)

.L3:
        leaq    -16(%rax), %rdx
        addq    $16, %rax
        cmpq    %rcx, %rax
        movl    -120(%rsp,%rdx,4), %edx
        jbe     .L3

If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra" instruction is gone:

.L2:
        movl    -120(%rsp,%rax,4), %ecx
        addq    $16, %rax
        cmpq    %rdx, %rax
        jbe     .L2

But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions. So another explanation could be that it's trading instruction parallelism for 1 extra instruction.

Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?

Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)

Это было полезно?

Решение

If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top