Why is this C++ string length calculation function faster than this other one?

Question 1

Ignoring the optimization for a moment and just looking at the paper algorithms:

The former performs this calculation repeatedly:

addr++

with a result calculated by a difference calculation

addr1 - addr0

the latter performs these calculations repeated

addr0 + i
i++

with the result being calculated by a value-return

In other words, twice as much work is being done in the loop for the benefit of doing half as much in calculating the final result.

getting to the optimized ASM, the first generates this at -O3 on my clang:

0x100000ee4:  cmpb   $0, 1(%rbx)
0x100000ee8:  leaq   1(%rbx), %rbx
0x100000eec:  sete   %al
0x100000eef:  testb  $1, %al
0x100000ef1:  je     0x100000ee4

the second generates this:

0x100000f09:  incl   %ebx
0x100000f0b:  cmpb   $0, (%rax)
0x100000f0e:  leaq   1(%rax), %rax
0x100000f12:  sete   %cl
0x100000f15:  testb  $1, %cl
0x100000f18:  je     0x100000f09

I left out the constant one-timers for the return values because they are not core to the complexity of the loop. The optimizer is pretty good, noting the only major difference is that single:

0x100000f09:  incl   %ebx

which is your i

Question 2

This is a micro-optimization, and a modern compiler would likely end up generating the same assembly for both, but for a non-optimized version, here is why:

int strlen_1(const char *str) 
{
    const char *temp = str; // declare the iterator
    while(*temp != '\0')   // dereference the pointer
                           // test the iterator 
    {
        temp++; // increment the iterator
    }
    return temp - str; // pointer subtraction
}

For a string of length N, this gives you 3N + 2 operations.

int strlen_03(const char *str) 
{
    int i; // declare your iterator
    for (i = 0; *(str+i) != '\0'; i++); // initialize the iterator
                                        // add i to str
                                        // dereference that pointer value
                                        // test it against \0
                                        // increment i
    return i;
}

For the same string, this gives you 4N + 2 operations.

Again, a modern compiler will likely fix this for you, and this small loop isn't likely to make much of a difference even in the un-optimized form for most strings (only for very long strings).