Question

This is a great article which talks about low level optimization techniques and shows an example where the author converts expensive divisions into cheap comparisons. https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920

For those who don't want to click, essentially he converted this:

uint32_t digits10(uint64_t v) {
    uint32_t result = 0;
    do {
        ++result;
         v /= 10;
    } while (v);
     return result;
}

Into this:

uint32_t digits10(uint64_t v) {
  uint32_t result = 1;
  for (;;) {
    if (v < 10) return result;
    if (v < 100) return result + 1;
    if (v < 1000) return result + 2;
    if (v < 10000) return result + 3;
    // Skip ahead by 4 orders of magnitude
    v /= 10000U;
    result += 4;
  }
}

Resulting in up to a 6 times speed up.

While comparisons are very cheap, I've always heard that branches are very expensive because they can cause pipeline stalls. Because of the conventional wisdom about branching, I never would have considered an approach like this.

Why is branching not a bottleneck in this case? Is it because we return right after the each of the comparisons? Is it because the code size here is small and thus there is not too much for the processor to mispredict? In what cases would it be a bottleneck and start to dominate the cost of the divisions? The author never speaks about this.

Can anyone resolve the apparent contention between cheap comparisons and expensive branches? Of course the golden rule of optimization is that one must always measure. However, it would at least be good to have some intuition about this issue so that one could use comparisons intelligently when trying to come up with new approaches to making code faster.

Thanks!

Was it helpful?

Solution

Branches aren't necessarily expensive -- it's really mispredicted branches that are expensive1.

So, let's start with the loop. It's infinite, so it's always taken. Since it's always taken, it's also always predicted as taken, so it's cheap.

Only one other branch is ever taken for any given input. That is to say, you do one test after another, and until you reach the one that matches the magnitude of the input number all the branches are not taken (i.e., the if condition will be false).

Assuming (for example) a random mix of input numbers with a maximum of, say, 16 digits, we end up taking approximately one of the four branches one out of 4 iterations of the loop. We only take a branch (on average) about one out of 16 tests, and a decent branch predictor will probably predict them all as not-taken nearly all the time. The result is that we probably end up with exactly one mis-predicted branch in the entire computation.

As a rule of thumb, figure that a correctly predicted branch takes around 1 clock, and a mis-predicted branch takes around 20-30 clocks. So, for a 16-digit number we end up with something like 15 digits + 4 loop iterations = 19 correctly predicted branches + 1 mis-predicted branch, for a total of something like 39-49 clocks total. For, say, a 2-digit number, we end up around 1+20=21 clocks.

The obvious alternative would be to divide by 10 and check the remainder every iteration. Division is relatively expensive -- for example, a 64-bit division can take around 26-86 cycles on an i7. For the sake of simplicity, let's assume an average of 40. So, for a 16-digit number we can expect around 16*40 = ~640 clocks for the divisions alone. Even at best, let's assume the 2-digit number that requires only 26 clocks per division, so we end up at 52 clocks total.

Bottom line: even in very close to the best case for it, division still ends up slower than nearly the worst case for comparisons. Most of the comparisons end up predicted correctly, so we typically end up with only one expensive (mis-predicted) branch.


1. This, of course, is assuming a modern, relatively high-end processor. On a really old processor (or a low-end embedded processor) you probably don't have a branch predictor at all, so all branches tend to be quite expensive. At the same time, such a processor may not have a divide instruction at all, and if it does, it's probably quite slow. In short, both branches and division take a lot more clocks than on a modern processor, but a branch is still usually quite a bit faster than a division.

OTHER TIPS

the first implementation actually branches more, even if it only has one branch point.

although, just as a matter of preference in coding style, I would use the first implementation. A collection of similar branches might well perform better, but it's still more code, and it looks like it was written thoughtlessly (as a matter of fact, why did it retain result?). And what if I want more than five digits? :|

The algorithm is mostly comparisons. The only explicit branch is when returning.

The gains is mostly from avoiding a costly division per digit which could take over 100 clock cycles each. One could make a case that since the max uint64_t value has 22 decimal digits that unrolling the loop into 22 comparisons would have been the most efficient way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top