Integer arithmetic is performed in hardware typically in a very small number of clock cycles.
You will not be able to get close to this performance in software. Your implementation using bitwise operations involves a function call and a loop. The bitwise operations that you perform typically cost similar numbers of clock cycles as arithmetic.
You are performing three bitwise operations per iteration. Frankly, I'm astonished that there is only a factor of 10 here.
I also wonder what your compiler settings are, specifically any optimizations. A good compiler could eliminate your while loop in the arithmetic version. For performance comparisons you should be comparing optimised code. It looks as if you might not be doing so.
It's difficult to know what you are trying to achieve here, but do not expect to beat the performance of hardware arithmetic units.