Question

I came across this excerpt today:

On most older microprocessors, bitwise operations are slightly faster than addition and subtraction operations and usually significantly faster than multiplication and division operations. On modern architectures, this is not the case: bitwise operations are generally the same speed as addition (though still faster than multiplication).

I'm curious about why bitwise operations were slightly faster than addition/subtraction operations on older microprocessors.

All I can think of that would cause the latency is that the circuits to implement addition/subtraction depend on several levels of logic gates (parallel adders and whatnot), whereas the bitwise operations have far simpler circuit implementations. Is this the reason?

I know arithmetic and bitwise operations both execute within one clock-cyle on modern processors, but speaking purely about propagation time for the circuit, is the latency still theoretically there in modern processors?

Finally, I had a conceptual C question about the execution of the bitwise shift operation:

unsigned x = 1;
x <<= 5;

unsigned y = 0;
y += 32;

Both x and y should hold the value 32, but did it take 5 separate left shifts to get x to that value (as in are bitwise shifts implemented via pipes)? To clarify, I'm asking purely about the circuit behavior not the number of clock cycles.

Was it helpful?

Solution

In any binary bitwise operation, each output bit depends only on the two corresponding bits in the inputs. In an add operation, each output bit depends on the corresponding bits in the inputs and all the bits to the right (toward lower values).

For example, the leftmost bit of 01111111 + 00000001 is 1, but the leftmost bit of 01111110 + 00000001 is 0.

In its simplest form, an adder adds the two low bits and produces one output bit and a carry. Then the next two lowest bits are added, and the carry is added in, producing another output bit and another carry. This repeats. So the highest output bit is at the end of a chain of adds. If you do the operation bit by bit, as older processors did, then it takes time to get to the end.

There are ways to speed this up some, by feeding several input bits into more complicated logic arrangements. But that of course requires more area in the chip and more power.

Today‘s processors have many different units for performing various sorts of work—loads, stores, addition, multiplication, floating-point operations, and more. Given today’s capabilities, the work of doing an add is small compared to other tasks, so it fits within a single processor cycle.

Perhaps in theory you could make a processor that did a bitwise operation more quickly than an add. (And there are, at least on paper, exotic processors that operate asynchronously, with different units doing work at their own paces.) However, with the designs in use, you need some regular fixed cycle to coordinate many things in the processor—loading instructions, dispatching them to execution units, sending results from execution units to registers, and much, much more. Some execution units do require multiple cycles to complete their jobs (e.g., some floating-point units take about four cycles to do a floating-point add). So you can have a mix. However, with current scales, making the cycle time smaller so that it fits a bitwise operation but not an add is likely not economical.

OTHER TIPS

The complicated thing about adding (you usually get subtracting for free) is that there is that pesky carry problem.

So, you end up with the naive solution being N times Full-Adders where N is how many bits wide your ALU is.

These pesky carries mean that you have a lot of propogation delay. And, because a single carry off can make the entire result inaccurate, you end up having to wait a fairly significant amount of time for all of the carry values and in-turn, all of the other full-adders down the chain to settle.

There are a lot of ways around this particular bottleneck, but none are as simple or resource-cheap to implement as a chain of full-adders. (the fastest being a lookup table implemented in silicon)

If you want more details, you'll probably need to ask this on http://electronics.stackexchange.com instead

To answer your last question, it depends. Some architectures only have shifts by 1 (such as z80), some architectures expose shifts by larger constants and/or variables but implement them internally as a bunch of "shift by 1"'s (such as old implementations of x86), there are some architectures that can shift by more than 1 in a single cycle but only if the shift amount is a constant, there are some architectures (such as modern implementations of x86) that use a barrel shifter and can shift by a variable in a single cycle, and there are still more possibilities.

The circuit depth of a barrel shifter is logarithmic in the maximum shift it can do, which is not necessarily the width of a register - it's sometimes one less than the width and it's conceivable for it to be even less.

Some addition implementations have to do an additional cycle for the carry bit. For example: an 16 bit integer require multiple instructions on a 8 bit processor. This also holds for the shift. But the shift can always shift the height bits to the lower bits of the next byte. The addition must add the lower bit in an additional round.

Bit wise operator executes on less time, because

  • processor take one instruction to perform bit wise operation and (let say) take one execution cycle, on the other hand other arithmetic instructions (specially, multiply and divide) take more execution cycles
  • Most of the time bit wise operation is performed with in one register, and other arithmetic instructions needed to handle more then one registers

That's why shifting bits are faster then other arithmetic operations

This I gleamed from an intro to assembly class. But shifting is just about the fastest instruction that a processor can execute. Adding and subtracting require a few instructions to execute. I imagine that modern processors are better optimized.

Presumably, someone can answer this more accurately and thoroughly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top