Clang/GCC Compiler Intrinsics without corresponding compiler flag

Question 1

Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.

// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options

//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------

int dummy1 (int a) {return a;}

//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------

static __m256i bitShiftLeft256ymm (__m256i *data, int count)
   {
   __m256i innerCarry, carryOut, rotate;

   innerCarry = _mm256_srli_epi64 (*data, 64 - count);                        // carry outs in bit 0 of each qword
   rotate     = _mm256_permute4x64_epi64 (innerCarry, 0x93);                  // rotate ymm left 64 bits
   innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC);   // clear lower qword
   *data    = _mm256_slli_epi64 (*data, count);                               // shift all qwords left
   *data    = _mm256_or_si256 (*data, innerCarry);                            // propagate carrys from low qwords
   carryOut   = _mm256_xor_si256 (innerCarry, rotate);                        // clear all except lower qword
   return carryOut;
   }

//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------

int main (void)
    {
    return 0;
    }

//----------------------------------------------------------------------------

Question 2

There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.

This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include etc to keep one single textfile as the main source, and just include the same file in multiple places)

Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:

#pragma use_sse=1
void func()
{
   ... some code goes here ... 
}

#pragma use_sse=3
void func2()
{
  ...
  func();
  ...
}

Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().

I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.

Edit: In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}

The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).

It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.

I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.

Edit2:

Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:

static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
  return __a + __b;
}

The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).

However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+ translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.