Mixing inline assembly with C code - how to protect registers and minimize memory access

Question

Since you asked about the additional part, I'll focus on that. Looking at your first #if block:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm( "movq %[src], %%r13" :
       :
       [src] "r"  (buffer) );
  asm( "pcmpeqd %[src], %[dst]" :
       [dst] "=x" (val) :
       [src] "x" (val) );
  asm( "movdqa %[src], (%%r13)" : :
       [src] "x" (val) );
  asm( "movdqa %[src], 16(%%r13)" : :
       [src] "x" (val) );   
}

This fragment writes to r13, without telling the compiler about it. That is very bad. Even if you had an asm("r13") on some local variable before calling this asm, this would be bad. You would still have to list that local variable as an output, then an input on the subsequent asms. What's more, it's both confusing to maintainers, and unnecessary.

Also, having multiple asm statements like this is a bad idea. gcc may not choose to keep them in this order. Such being the case, I'd suggest something more like this:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm("# val: %0" : "=x" (val)); /* fix "is used uninitialized" warning */

  asm( "pcmpeqd %[sval], %[dval]\n\t"
       "movdqa %[dval], %[buffer]\n\t"
       "movdqa %[dval], %[buffer1]" :

       [dval] "=x" (val), [buffer] "=m" (buffer[0]), [buffer1] "=m" (buffer[1]) :
       [sval] "x" (val) );
}

As for your #else block:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm( "pcmpeqd %[src], %[dst]" :
       [dst] "=x" (val) :
       [src] "x" (val) );
  asm( "movdqa %[src], %[dst]" :
       [dst] "=X" (buffer) :
       [src] "x" (val) );
  asm( "movdqa %[src], %[dst]" :
       [dst] "=X" (buffer+1) :
       [src] "x" (val) );
}

I would suggest:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm("# val: %0" : "=x" (val)); /* fix "is used uninitialized" warning */

  asm( "pcmpeqd %[sval], %[dval]\n\t"
       "movdqa %[dval], (%[sbuffer])\n\t"
       "movdqa %[dval], 16(%[sbuffer])" :

       [dval] "=x" (val), [buffer] "=m"  (buffer), [buffer1] "=m" (buffer[1]) :
       [sval] "x" (val), [sbuffer] "r"  (buffer));
}

There are a few things to note here.

I'm using the first asm statement to resolve a compiler warning about using val before it gets assigned. This is caused by using val as a input, when it has never had a value assigned to it. Presumably in your real code you assign a reasonable value before using it.
By putting the 3 asm statements into one asm block, gcc can't move individual pieces around.
Why do I have sbuffer, buffer and buffer1, but never reference buffer and buffer1? sbuffer is used to get the pointer to the array into a register. "buffer" and "buffer1" are listed as outputs since I must tell gcc that I am changing these. Using the "memory" clobber is easier, but that can have serious performance implications. Alternately I could use some form of (from the gcc docs re extended asm):

{"m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )}.

This tells gcc that 10 chars starting at ptr will be accessed. Ugly, but it works if you know at compile time how many bytes of memory you're modifying. Point being, if you are changing any values in your asm (even entries in an array), you must let gcc know.

What else? Ahh yes, let's look at the asm (from -Os):

pcmpeqd %xmm0, %xmm0
movdqa %xmm0, (%rax)
movdqa %xmm0, 16(%rax)

As I understand it, the whole reason you were trying to use r13 is to avoid having the register clobbered when you call some subroutines that you don't control, wasting cycles reloading it each loop. So having this code use rax, well, that doesn't seem like a good idea, right? But wait! Watch what happens with this code:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  for (int x=0; x < 10; x++)
  {
    asm("# val: %0" : "=x" (val)); /* fix "is used uninitialized" */

    asm( "pcmpeqd %[src], %[dst]\n\t"
         "movdqa %[src], (%[sbuffer])\n\t" /* buffer[0] */
         "movdqa %[src], 16(%[sbuffer])" : /* buffer[1] */

         [dst] "=x" (val), [buffer] "=m"  (buffer), [buffer1] "=m" (buffer[1]) :
         [src] "x" (val), [sbuffer] "r"  (buffer));

     printf("%d\n", val);
   }
}

The asm is the same, but now we are in a loop and calling printf (a routine we don't control). What does the asm look like now? Here's the loop:

.L2:
    leaq    .LC0(%rip), %rcx
    movq    %rdi, %rdx
    pcmpeqd %xmm6, %xmm0
    movdqa %xmm6, (%rbx)
    movdqa %xmm6, 16(%rbx)
    movapd  %xmm0, 32(%rsp)
    call    printf
    subl    $1, %esi
    jne .L2

Well, it has changed from rax to rbx. Is that better? Well, actually it is. When you call subroutines in c, there are some rules the compiler must followed (ABI). These rules control things like where parameters get passed, where return values are located, who cleans up the stack, and (most importantly for our purposes here) what registers the subroutine must preserve (ie must have the same value when it returns). There is some discussion and useful links about this here on wikipedia. And one thing of note is that rbx must be preserved (for x86-64).

As a result, if you look at the asm surrounding this code, you will notice that rbx only gets loaded once (outside the loop). Gcc knows that if any subroutines muck with rbx, they will put it back when they are done. What's more, since subroutines know they have to preserve rbx, they tend to avoid it unless the benefit of having one more register available is greater than the cost of saving/restoring it.

As for the whole idea of "reserving" a register and preventing any subroutine from using it, well, I won't say it's impossible (see Global Reg Vars and -ffixed-reg), but I will say that usually it's a terrible idea. Registers on x86 are a very useful and very limited resource. Trying to limit the number available will almost certainly cause more performance issues than it could ever fix.

There are two important take aways here:

Trust the compiler. Letting it know you need a pointer to "buffer" in a register is usually sufficient. Gcc is (usually) smart enough to pick the best register for the task.
You MUST MUST MUST tell gcc everything that you are changing in the asm block either by using clobbers or outputs (do not modify inputs). Failing to do this will lead to weird and hard to track down problems.

Ok, a lot of detail here (probably way more than you need). Hopefully the answers you seek are here as well.