Fastest way to split a word into two bytes

Question 1

I'm 99.9% sure the first one is at least as fast as the second in nearly all architectures. There may be some architectures where it makes no difference (they are equal), and in several architectures, the latter will be slower.

The main reason I'd say the second is slower is that there are two shifts to come up with the c2 number. The processor can't start to process the second shift until it has done the first shift.

Also, the compiler may well be able to do other clever stuff with the first one (if there are instructions to do that - for example an x86 processor can load s into AX, and store AL into c1 and AH into c2 - no extra instructions beyond the store operation), where the second one is much less likely to be a "known common pattern" (I certainly have never seen that variant being used in code, where the shift/and method is very commonly used - often in "pixel loops", meaning it's critical to implement good optimisation for it).

As always, measure, measure and measure again. And unless you are ONLY interested in your particular machines performance, try it on different models/manufacturers of processors, so you don't make something that is 5% faster on your model of machine, but 20% slower on another model.

Question 2

Let the compiler do this work for you. Use union, where the bytes will be split without any hand made bit-shifts. Look at the pseudo code:

union U {
  short s;  // or use int16_t to be more specific
  //   vs.
  struct Byte {
    char c1, c2;  // or use int8_t to be more specific
  }
  byte;
};

Usage is simple:

U u;
u.s = 0x3210;
std::cout << u.byte.c1 << " and " << u.byte.c2;

The concept is simple, afterwards you can overload the operators to make it more fancy if you want.

Important to note that depending on your compiler the order of c1 and c2 may differ, but that will be known before the compilation. You can set some conditinal macros to make sure that order is according to your needs in any compiler.

Question 3

You should certainly use a cast rather than either a mask or two shifts to extract the low-order byte. The compiler will then do whatever it already knows to be fastest. That leaves you with the high-order bit, for which there is only one choice.

Question 4

You should time each of them in a for (long i = 0; i < 100000000; i++). I did, and the faster was the 1st one (0.82s versus 0.84s). An easy way of doing so in MVS is setting a watch on @clk.

Fastest way to split a word into two bytes

Edit