Pregunta

I have a routine that I would like to write mostly in assembly, but I need to call C functions to get some data that I need for processing. In some cases, I can pre-digest the data and load a register with a pointer to it, but in other cases, I have to call the full function because the possible data set is too large. These functions cannot be modified because they are someone else's code, and its interface needs to remain the same for other pieces of code. Some of them reside in shared libraries, though some are inlined functions through header files (which I can't change).

I can assign local variables to registers using the asm construct:

register int myReg asm( "%r13" );

I'm afraid that if I then directly manipulate %r13 in assembly, call a C function, and return, it will need to be refreshed from memory, or worse yet just be completely overwritten. For certain ABI's it's also not safe for me to push/pop the registers directly myself, correct? I'm working in x86-64 on Linux.

What I'm doing right now seems to be working with -g -O0, but I'm afraid that when I turn the optimizations on for the C code, it will start touching registers that I was hoping would be protected.

In general my code flow looks like:

asm( "movq %[src], %%r13" : : [src] "m" (myVariablePointer) : "r13" );

localVariable1 = callSomeCfunction( stuff );

storageLocation = index * sizeof( longTermStorageItem );
longTermStorage[storageLocation] = localVariable1;
// some intermediate registers need to be used here for dereferences and math

switch ( localVariable1 )
{
   case CONSTANT_VAL_A:
     C_MACRO_LOOP_CONSTRUCT
     {
       asm( "movdqa (%r13), %xmm0\n"
            // ... do some more stuff
     } C_MACRO_LOOP_CONSTRUCT_ENDING
   break;
   case CONSTANT_VAL_B:
     // ... and so forth
}

The "C_MACRO_LOOP_CONSTRUCT" things are #defines from a foreign header file with "for" loops that need to dereference some pointers and whatnot in the process, and store the iterator in a local variable.

So my concern is how to ensure that %r13 is preserved across all of this stuff. So far the compiler hasn't touched it, but I'm sure that's more by luck than by design. And preservation of the value itself isn't my only concern. I want it to remain in the register where I put it if at all possible. Moving it out to local/stack storage and back frequently will kill my performance.

Is there a way I can better protect a small subset of registers from the compiler/optimizer?

ADDITIONAL INFORMATION

Here's why I want to do this. Look at the code below:

#include <emmintrin.h>
#include <stdio.h>

__m128d buffer[100];   

int main( void )
{
  unsigned long long *valPtr;

  register __m128d val;
  register __m128d *regPtr;
#ifdef FORCED  
  asm( "movq %[src], %%r13" :
       :
       [src] "r"  (buffer) );
  asm( "pcmpeqd %[src], %[dst]" :
       [dst] "=x" (val) :
       [src] "x" (val) );
  asm( "movdqa %[src], (%%r13)" : :
       [src] "x" (val) );
  asm( "movdqa %[src], 16(%%r13)" : :
       [src] "x" (val) );   
#else
  asm( "pcmpeqd %[src], %[dst]" :
       [dst] "=x" (val) :
       [src] "x" (val) );
  asm( "movdqa %[src], %[dst]" :
       [dst] "=X" (buffer) :
       [src] "x" (val) );
  asm( "movdqa %[src], %[dst]" :
       [dst] "=X" (buffer+1) :
       [src] "x" (val) );
#endif

  valPtr = (unsigned long long *)buffer;
  printf( "OUTPUT: [0] %016llx%016llx, [1] %016llx%016llx\n",
   valPtr[0], valPtr[1], valPtr[2], valPtr[3] );

  return 0;
}

If I compile this with "FORCED" defined, it builds and it works. But this is scary because the compiler is not protecting "%r13" in this case (it could be any register, doesn't matter). But by using a hard-coded register, I can use the indexed addressing mode, namely 16(%%r13). This saves me the extra instruction to increment the value and lets me store to the new location all in one step.

If I try to compile without "FORCED", gcc reports:

y.c: In function \u2018main\u2019:
y.c:32: error: invalid lvalue in asm statement
y.c:30: error: invalid lvalue in asm output 0

So I guess my question should become, can I use the indexed addressing mode with an appropriate constraint? I tried "m", "X", and "o". No difference. If I try to pull the offset into the assembly and out of the parameter like this:

asm( "movdqa %[src], 16(%[dst])" :
 [dst] "=m" (buffer) :
 [src] "x" (val) );

GCC responds with:

/tmp/ccoNwyco.s: Assembler messages:
/tmp/ccoNwyco.s:28: Error: junk `(buffer(%rip))' after expression

Any idea how use this addressing mode and eliminate the unnecessary instruction?

¿Fue útil?

Solución

Since you asked about the additional part, I'll focus on that. Looking at your first #if block:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm( "movq %[src], %%r13" :
       :
       [src] "r"  (buffer) );
  asm( "pcmpeqd %[src], %[dst]" :
       [dst] "=x" (val) :
       [src] "x" (val) );
  asm( "movdqa %[src], (%%r13)" : :
       [src] "x" (val) );
  asm( "movdqa %[src], 16(%%r13)" : :
       [src] "x" (val) );   
}

This fragment writes to r13, without telling the compiler about it. That is very bad. Even if you had an asm("r13") on some local variable before calling this asm, this would be bad. You would still have to list that local variable as an output, then an input on the subsequent asms. What's more, it's both confusing to maintainers, and unnecessary.

Also, having multiple asm statements like this is a bad idea. gcc may not choose to keep them in this order. Such being the case, I'd suggest something more like this:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm("# val: %0" : "=x" (val)); /* fix "is used uninitialized" warning */

  asm( "pcmpeqd %[sval], %[dval]\n\t"
       "movdqa %[dval], %[buffer]\n\t"
       "movdqa %[dval], %[buffer1]" :

       [dval] "=x" (val), [buffer] "=m" (buffer[0]), [buffer1] "=m" (buffer[1]) :
       [sval] "x" (val) );
}

As for your #else block:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm( "pcmpeqd %[src], %[dst]" :
       [dst] "=x" (val) :
       [src] "x" (val) );
  asm( "movdqa %[src], %[dst]" :
       [dst] "=X" (buffer) :
       [src] "x" (val) );
  asm( "movdqa %[src], %[dst]" :
       [dst] "=X" (buffer+1) :
       [src] "x" (val) );
}

I would suggest:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  asm("# val: %0" : "=x" (val)); /* fix "is used uninitialized" warning */

  asm( "pcmpeqd %[sval], %[dval]\n\t"
       "movdqa %[dval], (%[sbuffer])\n\t"
       "movdqa %[dval], 16(%[sbuffer])" :

       [dval] "=x" (val), [buffer] "=m"  (buffer), [buffer1] "=m" (buffer[1]) :
       [sval] "x" (val), [sbuffer] "r"  (buffer));
}

There are a few things to note here.

  1. I'm using the first asm statement to resolve a compiler warning about using val before it gets assigned. This is caused by using val as a input, when it has never had a value assigned to it. Presumably in your real code you assign a reasonable value before using it.
  2. By putting the 3 asm statements into one asm block, gcc can't move individual pieces around.
  3. Why do I have sbuffer, buffer and buffer1, but never reference buffer and buffer1? sbuffer is used to get the pointer to the array into a register. "buffer" and "buffer1" are listed as outputs since I must tell gcc that I am changing these. Using the "memory" clobber is easier, but that can have serious performance implications. Alternately I could use some form of (from the gcc docs re extended asm):

{"m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )}.

This tells gcc that 10 chars starting at ptr will be accessed. Ugly, but it works if you know at compile time how many bytes of memory you're modifying. Point being, if you are changing any values in your asm (even entries in an array), you must let gcc know.

What else? Ahh yes, let's look at the asm (from -Os):

pcmpeqd %xmm0, %xmm0
movdqa %xmm0, (%rax)
movdqa %xmm0, 16(%rax)

As I understand it, the whole reason you were trying to use r13 is to avoid having the register clobbered when you call some subroutines that you don't control, wasting cycles reloading it each loop. So having this code use rax, well, that doesn't seem like a good idea, right? But wait! Watch what happens with this code:

__m128d buffer[100];   

int main( void )
{
  register __m128d val;

  for (int x=0; x < 10; x++)
  {
    asm("# val: %0" : "=x" (val)); /* fix "is used uninitialized" */

    asm( "pcmpeqd %[src], %[dst]\n\t"
         "movdqa %[src], (%[sbuffer])\n\t" /* buffer[0] */
         "movdqa %[src], 16(%[sbuffer])" : /* buffer[1] */

         [dst] "=x" (val), [buffer] "=m"  (buffer), [buffer1] "=m" (buffer[1]) :
         [src] "x" (val), [sbuffer] "r"  (buffer));

     printf("%d\n", val);
   }
}

The asm is the same, but now we are in a loop and calling printf (a routine we don't control). What does the asm look like now? Here's the loop:

.L2:
    leaq    .LC0(%rip), %rcx
    movq    %rdi, %rdx
    pcmpeqd %xmm6, %xmm0
    movdqa %xmm6, (%rbx)
    movdqa %xmm6, 16(%rbx)
    movapd  %xmm0, 32(%rsp)
    call    printf
    subl    $1, %esi
    jne .L2

Well, it has changed from rax to rbx. Is that better? Well, actually it is. When you call subroutines in c, there are some rules the compiler must followed (ABI). These rules control things like where parameters get passed, where return values are located, who cleans up the stack, and (most importantly for our purposes here) what registers the subroutine must preserve (ie must have the same value when it returns). There is some discussion and useful links about this here on wikipedia. And one thing of note is that rbx must be preserved (for x86-64).

As a result, if you look at the asm surrounding this code, you will notice that rbx only gets loaded once (outside the loop). Gcc knows that if any subroutines muck with rbx, they will put it back when they are done. What's more, since subroutines know they have to preserve rbx, they tend to avoid it unless the benefit of having one more register available is greater than the cost of saving/restoring it.

As for the whole idea of "reserving" a register and preventing any subroutine from using it, well, I won't say it's impossible (see Global Reg Vars and -ffixed-reg), but I will say that usually it's a terrible idea. Registers on x86 are a very useful and very limited resource. Trying to limit the number available will almost certainly cause more performance issues than it could ever fix.

There are two important take aways here:

  1. Trust the compiler. Letting it know you need a pointer to "buffer" in a register is usually sufficient. Gcc is (usually) smart enough to pick the best register for the task.
  2. You MUST MUST MUST tell gcc everything that you are changing in the asm block either by using clobbers or outputs (do not modify inputs). Failing to do this will lead to weird and hard to track down problems.

Ok, a lot of detail here (probably way more than you need). Hopefully the answers you seek are here as well.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top