Difference between rdtscp, rdtsc : memory and cpuid / rdtsc?

https://stackoverflow.com/questions/12631856

04-07-2021
|

Question

Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.

These are our options:

1: rdtscp is a serializing call. It prevents reordering around the call to rdtscp.

__asm__ __volatile__("rdtscp; "         // serializing read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc variable
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

However, rdtscp is only available on newer CPUs. So in this case we have to use rdtsc. But rdtsc is non-serializing, so using it alone will not prevent the CPU from reordering it.

So we can use either of these two options to prevent reordering:

2: This is a call to cpuid and then rdtsc. cpuid is a serializing call.

volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing
unsigned tmp;
__cpuid(0, tmp, tmp, tmp, tmp);                   // cpuid is a serialising call
dont_remove = tmp;                                // prevent optimizing out cpuid

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

3: This is a call to rdtsc with memory in the clobber list, which prevents reordering

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered
                                                  // memory to prevent reordering

My understanding for the 3rd option is as follows:

Making the call __volatile__ prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So __volatile__ is not enough.

Tell the compiler memory is being clobbered: : "memory"). The "memory" clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.

So my questions are:

1: Is my understanding of __volatile__ and "memory" correct?
2: Do the second two calls do the same thing?
3: Using "memory" looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?

Solution

As mentioned in a comment, there's a difference between a compiler barrier and a processor barrier. volatile and memory in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.

Processor barrier are special instructions that must be explicitly given, e.g. rdtscp, cpuid, memory fence instructions (mfence, lfence, ...) etc.

As an aside, while using cpuid as a barrier before rdtsc is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use one of the memory fence instructions.

The Linux kernel uses mfence;rdtsc on AMD platforms and lfence;rdtsc on Intel. If you don't want to bother with distinguishing between these, mfence;rdtsc works on both although it's slightly slower as mfence is a stronger barrier than lfence.

Edit 2019-11-25: As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f

OTHER TIPS

you can use it like shown below:

asm volatile (
"CPUID\n\t"/*serialize*/
"RDTSC\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r"
(cycles_low):: "%rax", "%rbx", "%rcx", "%rdx");
/*
Call the function to benchmark
*/
asm volatile (
"RDTSCP\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high1), "=r"
(cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx");

In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. With this method we avoid to call a CPUID instruction in between the reads of the real-time registers

The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed. The two “mov” instructions coming afterwards store the edx and eax registers values into memory. Finally a CPUID call guarantees that a barrier is implemented again so that it is impossible that any instruction coming afterwards is executed before CPUID itself.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow