Question

I wonder is it faster to replace branching with 2 multiplications or no (due to cache miss penalty)?
Here is my case:

float dot = rib1.x*-dir.y + rib1.y*dir.x;

if(dot<0){
    dir.x = -dir.x;
    dir.y = -dir.y;
}

And I'm trying to replace it with:

float dot = rib1.x*-dir.y + rib1.y*dir.x;

int sgn = (dot  < 0.0) - (0.0 < dot ); //returns -1 or 1 (no branching here, tested)
dir.x *= sgn;
dir.y *= sgn;
Was it helpful?

Solution 2

The cost of the multiplication depends on several factors, whether you use 32-bit or 64-bit floats, and whether you enable SSE or not. The cost of two float multiplications is 10 cycles according to this source: http://www.agner.org/optimize/instruction_tables.pdf

The cost of the branch also depends on several factors. As a rule of thumb, do not worry about branches in your code. The exact behaviour of the branch predictor on the CPU will define the performance, but in this case you should probably expect that the branch will be unpredictable at best, so this is likely to lead to a lot of branch mispredictions. The cost of a branch misprediction is 10-30 cycles according to this source: http://valgrind.org/docs/manual/cg-manual.html

The best advice anyone can give here is to profile and test. I would guess that on a modern Core i7 the two multiplications should be faster than the branch, if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication.

Assuming 50% miss rate, the cost of the branch averages 15 cycles (30 * 0.5), the cost of the float mul is 10 cycles.


EDIT: Added links, updated estimated instruction cost.

OTHER TIPS

Branching does not imply cache miss: only instruction prefetching/pipelining is disturbed, so it's possible you block some SSE optimization at compile-time with it.

On the other side, if x86 instructions are being used only, the speculative execution will let the processor to properly start the execution of the most used branch.

On the other side, if you enter the if for the 50% of the times you are in the worst condition: in this case I'd try to look for SSE pipelining and to have the execution optimized with SSE, probably getting some hints from this post, in line with your second block of code.

However, benchmark your code, check the produced assembler in order to find the best solution for this optimization, and get the proper insight. And eventually keep us updated :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top