I think I've got it now!
I'm using the intrinsics, mainly because they are easy to use and don't require leaving the "C world". The key for me was localizing the scope of my register variables. That probably should have been obvious, but I got bogged down in the details. My actual code now looks like this:
...
case SimF_AND:
{
register __m128d VAL0 = *(cmfmLocs[0]);
register __m128d VAL1 = *(cmfmLocs[0]+1);
register __m128d diff0 = outputValPtr[0];
register __m128d diff1 = outputValPtr[1];
cfPin = 1;
do
{
VAL0 = _mm_and_pd( VAL0, *(cmfmLocs[cfPin]) );
VAL1 = _mm_or_pd( VAL1, *(cmfmLocs[cfPin]+1) );
cfPin++;
} while ( cmfmLocs[cfPin] );
diff0 = _mm_or_pd( _mm_xor_pd( VAL0, diff0 ), _mm_xor_pd( VAL1, diff1 ) ); \
if ( !_mm_testz_pd( diff0, diff0 ) ) \
{ \
outputValPtr[0] = VAL0; \
outputValPtr[1] = VAL1; \
outputValPtr[2] = _mm_xor_pd( VAL0, VAL0 ); \
SCHEDULE_GOTOS; \
}
} // register variables go out of scope here
break;
...
So now it is very easy for both me and the compiler to see that these variables are not referenced after outputValPtr is updated. This produces assembly that does not reserve stack space for the locals so they don't generate any memory writes of their own anymore.
Thanks to all those who left responses. You definitely lead me down the right path!