Domanda

I'm using gcc on x86-64 and declaring some local variables with the "register" modifier. I would like to find a way to severely discourage the compiler from allocating and using stack space for these variables. I'd like these variables to remain in registers as much as possible. I'm mixing C/C++ code with inline assmebly.

The variables are simply working storage and don't need to be permanently stored and retrieved later, but yet I see my gcc -O2 code still tucking them into their local stack space from time to time. I understand that their state will need to be preserved when I make C/C++ function calls from time to time, but can I do something to be certain that this preservation is severely discouraged?

Here is an example of what I'm doing. This is a portion of an event-driven logic simulator for those who are wondering:

register __m128d VAL0, VAL1, diff0, diff1;
register __m128d *outputValPtr;
__m128d **cmfmLocs;
...
// all pointers are made to point to valid data
// cmfmLocs is a 0-terminated array of pointers with at least one entry

diff0 = outputValPtr[0];
diff1 = outputValPtr[1];
VAL0 = *(cmfmLocs[0]);
VAL1 = *(cmfmLocs[0]+1);

cfPin = 1;
do
{
  asm( "andpd %[src1], %[dest1]\n"
       "orpd  %[src2], %[dest2]\n" :
       [dest1] "=x" (VAL0),
       [dest2] "=x" (VAL1) :
       [src1]  "m" (*(cmfmLocs[cfPin])),
       [src2]  "m" (*(cmfmLocs[cfPin]+1)) );
  cfPin++;
} while ( cmfmLocs[cfPin] );

asm( "xorpd %[val0], %[diffBit0]\n"
     "xorpd %[val1], %[diffBit1]\n"
     "orpd  %[diffBit1], %[diffBit0]\n"
     "ptest %[diffBit0], %[diffBit0]\n"
     "jz dontSchedule\n"
     "movdqa %[val0],   (%[permStor])\n"
     "movdqa %[val1], 16(%[permStor])\n" :
     [diffBit0]  "=x" (diff0),
     [diffBit1]  "=x" (diff1),
     [memWrite1] "=m" (outputValPtr[0]),
     [memWrite2] "=m" (outputValPtr[1]) :
     [val0]      "x"  (VAL0),
     [val1]      "x"  (VAL1),
     [permStor]  "p"  (outputValPtr) );
SCHEDULE_GOTOS;
asm( "dontSchedule:\n" );

This code produced the following assembly with -O2:

2348: 48 8b 4b 50           mov    0x50(%rbx),%rcx
234c: ba 01 00 00 00        mov    $0x1,%edx
2351: 48 8b 41 08           mov    0x8(%rcx),%rax
2355: 0f 1f 00              nopl   (%rax)
2358: 83 c2 01              add    $0x1,%edx
235b: 66 0f 54 00           andpd  (%rax),%xmm0
235f: 66 0f 56 48 10        orpd   0x10(%rax),%xmm1
2364: 0f b7 c2              movzwl %dx,%eax
2367: 66 0f 29 4c 24 20     movapd %xmm1,0x20(%rsp)   # Why is this necessary?
236d: 66 0f 29 44 24 30     movapd %xmm0,0x30(%rsp)   # Why is this necessary?
2373: 48 8b 04 c1           mov    (%rcx,%rax,8),%rax
2377: 48 85 c0              test   %rax,%rax
237a: 75 dc                 jne    2358 <TEST_LABEL+0x10>
237c: 66 0f 57 d0           xorpd  %xmm0,%xmm2
2380: 66 0f 57 d9           xorpd  %xmm1,%xmm3
2384: 66 0f 56 d3           orpd   %xmm3,%xmm2
2388: 66 0f 38 17 d2        ptest  %xmm2,%xmm2
238d: 0f 84 cf e7 ff ff     je     b62 <dontSchedule>
2393: 66 41 0f 7f 07        movdqa %xmm0,(%r15)     # After storing here, xmm0/1 values
2398: 66 41 0f 7f 4f 10     movdqa %xmm1,0x10(%r15) #  are not needed anymore.
... # my C scheduler routine here ...
0000000000000b62 <dontSchedule>:
È stato utile?

Soluzione

I think I've got it now!

I'm using the intrinsics, mainly because they are easy to use and don't require leaving the "C world". The key for me was localizing the scope of my register variables. That probably should have been obvious, but I got bogged down in the details. My actual code now looks like this:

      ...
      case SimF_AND:
      {
        register __m128d VAL0 = *(cmfmLocs[0]);
        register __m128d VAL1 = *(cmfmLocs[0]+1);
        register __m128d diff0 = outputValPtr[0];
        register __m128d diff1 = outputValPtr[1];
        cfPin = 1;
        do
        {
          VAL0 = _mm_and_pd( VAL0, *(cmfmLocs[cfPin]) );
          VAL1 =  _mm_or_pd( VAL1, *(cmfmLocs[cfPin]+1) );
          cfPin++;
        } while ( cmfmLocs[cfPin] );
        diff0 = _mm_or_pd( _mm_xor_pd( VAL0, diff0 ), _mm_xor_pd( VAL1, diff1 ) ); \
        if ( !_mm_testz_pd( diff0, diff0 ) ) \
        { \
          outputValPtr[0] = VAL0; \
          outputValPtr[1] = VAL1; \
          outputValPtr[2] = _mm_xor_pd( VAL0, VAL0 ); \
          SCHEDULE_GOTOS; \
        }
      } // register variables go out of scope here
      break;
      ...

So now it is very easy for both me and the compiler to see that these variables are not referenced after outputValPtr is updated. This produces assembly that does not reserve stack space for the locals so they don't generate any memory writes of their own anymore.

Thanks to all those who left responses. You definitely lead me down the right path!

Altri suggerimenti

I was told long ago that there are three classes of C compilers: Really dumb ones, which just don't care about the register keyword, dumb ones, which heed the keyword, and reserve a few regsters for this; and smart ones, which really do a much better job handling shuffling values around than just keeping a value in a fixed register.

If you use GCC's inline assembly, where the values reside should be (almost) transparent. You can force getting the arguments in specific registers by use of restrictions, and the compiler will make sure this gets respected.

Besides, "just working storage" isn't a good enough reason to use up a valuable register. Even in x86_64, which isn't register-starved. Writing parts of the program in assembly has a terrible cost in programmer time and portability, you better make very sure this is relevant performance wise (or the code can't be written portably in the first place).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top