Ignoring the optimization for a moment and just looking at the paper algorithms:
The former performs this calculation repeatedly:
addr++
with a result calculated by a difference calculation
addr1 - addr0
the latter performs these calculations repeated
addr0 + i
i++
with the result being calculated by a value-return
i
In other words, twice as much work is being done in the loop for the benefit of doing half as much in calculating the final result.
getting to the optimized ASM, the first generates this at -O3 on my clang:
0x100000ee4: cmpb $0, 1(%rbx)
0x100000ee8: leaq 1(%rbx), %rbx
0x100000eec: sete %al
0x100000eef: testb $1, %al
0x100000ef1: je 0x100000ee4
the second generates this:
0x100000f09: incl %ebx
0x100000f0b: cmpb $0, (%rax)
0x100000f0e: leaq 1(%rax), %rax
0x100000f12: sete %cl
0x100000f15: testb $1, %cl
0x100000f18: je 0x100000f09
I left out the constant one-timers for the return values because they are not core to the complexity of the loop. The optimizer is pretty good, noting the only major difference is that single:
0x100000f09: incl %ebx
which is your i