Arithmetic shift using intel intrinsics

Question 1

I'm not sure why you are looking for intrinsics for this. Why not just use ordinary C++ right shift? This behavior is implementation defined but AFAIK on Intel platforms it will always sign extend.

int16_t val = 1 << 15; // 1000 0000 0000 0000 
int16_t shiftVal = val >> 15; // 1111 1111 1111 1111

Question 2

As mattnewport states in his answer, C code can do the job efficiently though with the possibility of 'implementation defined' behavior. This answer shows how to avoid the implementation defined behavior while preserving efficient code generation.

Because the question is about shifting a 16-bit operand, the concern over the implementation defined decision whether to sign extend or zero fill can be avoided by first sign extending the operand to 32-bits. Then the 32-bit value can be right shifted as unsigned, and finally truncated back to 16-bits.

Mattnewport's code actually sign extends the 16-bit operand to an int (32-bit or 64-bit depending on the compiler model) before shifting. This is because the language specification (C99 6.5.7 bitwise shift operators) requires a first step: integer promotions are performed on each of the operands. Likewise, the result of mattnewport's code is an int because The type of the result is that of the promoted left operand. Because of this, the code variation that avoids implementation defined behavior generates the same number of instructions as mattnewport's original code.

To avoid implementation defined behavior, the implicit promotion to signed int is replaced with an explicit promotion to unsigned int. This eliminates any possibility of implementation defined behavior while maintaining the same code efficiency.

The idea can be extended to cover 32-bit operands and works efficiently when 64-bit native integer support is present. Here is an example:

// use standard 'll' for long long print format
#define __USE_MINGW_ANSI_STDIO 1
#include <stdio.h>
#include <stdint.h>

// Code provided by mattnewport
int16_t aShiftRight16x (int16_t val, int count)
    {
    return val >> count;
    }

// This variation avoids implementation defined behavior
int16_t aShiftRight16y (int16_t val, int count)
    {
    uint32_t uintVal = val;
    uint32_t uintResult = uintVal >> count;
    return (int16_t) uintResult;
    }

// A 32-bit arithmetic right shift without implementation defined behavior
int32_t aShiftRight32 (int32_t val, int count)
    {
    uint64_t uint64Val = val;
    uint64_t uint64Result = uint64Val >> count;
    return (int32_t) uint64Result;
    }

int main (void)
    {
    int16_t val16 = 0x8000;
    int32_t val32 = 0x80000000;
    int count;

    for (count = 0; count <= 15; count++)
        printf ("%04hX %04hX %08X\n", aShiftRight16x (val16, count),
                                      aShiftRight16y (val16, count),
                                      aShiftRight32  (val32, count));
    return 0;
    }

Here is gcc 4.8.1 x64 code generation:

  0000000000000030 <aShiftRight16x>:
    30: 0f bf c1                movsx  eax,cx
    33: 89 d1                   mov    ecx,edx
    35: d3 f8                   sar    eax,cl
    37: c3                      ret    

  0000000000000040 <aShiftRight16y>:
    40: 0f bf c1                movsx  eax,cx
    43: 89 d1                   mov    ecx,edx
    45: d3 e8                   shr    eax,cl
    47: c3                      ret    

  0000000000000050 <aShiftRight32>:
    50: 48 63 c1                movsxd rax,ecx
    53: 89 d1                   mov    ecx,edx
    55: 48 d3 e8                shr    rax,cl
    58: c3                      ret

Here is MS visual studio x64 code generation:

  aShiftRight16x:
    00: 0F BF C1           movsx       eax,cx
    03: 8B CA              mov         ecx,edx
    05: D3 F8              sar         eax,cl
    07: C3                 ret
  aShiftRight16y:
    10: 0F BF C1           movsx       eax,cx
    13: 8B CA              mov         ecx,edx
    15: D3 E8              shr         eax,cl
    17: C3                 ret
  aShiftRight32:
    20: 48 63 C1           movsxd      rax,ecx
    23: 8B CA              mov         ecx,edx
    25: 48 D3 E8           shr         rax,cl
    28: C3                 ret

program output:

8000 8000 80000000
C000 C000 C0000000
E000 E000 E0000000
F000 F000 F0000000
F800 F800 F8000000
FC00 FC00 FC000000
FE00 FE00 FE000000
FF00 FF00 FF000000
FF80 FF80 FF800000
FFC0 FFC0 FFC00000
FFE0 FFE0 FFE00000
FFF0 FFF0 FFF00000
FFF8 FFF8 FFF80000
FFFC FFFC FFFC0000
FFFE FFFE FFFE0000
FFFF FFFF FFFF0000