Have you looked at the rdtscp
instruction? This is the read serialized version of rdtsc
.
For benchmarking I would recommend to read this whitepaper. It provides a couple of best practices for measuring clock ticks.
Alex(Intel)
Pregunta
I have seen the related question including here and here, but it seems that the only instruction ever mentioned for serializing rdtsc
is cpuid
.
Unfortunately, cpuid
takes roughly 1000 cycles on my system, so I am wondering if anyone knows of a cheaper (fewer cycles and no read or write to memory) serializing instruction?
I looked at iret
, but that seems to change control flow, which is also undesirable.
I have actually looked at the whitespaper linked in Alex's answer about rdtscp
, but it says:
The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed.
That second point seems to be make it less than ideal.
Solución
Have you looked at the rdtscp
instruction? This is the read serialized version of rdtsc
.
For benchmarking I would recommend to read this whitepaper. It provides a couple of best practices for measuring clock ticks.
Alex(Intel)
Otros consejos
For ordering rdtsc
wrt. other instructions, lfence
is sufficient if you don't need to wait for the store buffer to drain. Since always on Intel, since Spectre mitigation on AMD. See solution to rdtsc out of order execution?
rdtscp
is also guaranteed to be ordered wrt. earlier instructions (but not later; in practice it's probably microcoded pretty much like lfence
;rdtsc
in that order, plus uops to write ECX with the processor ID.) It's not an x86 serializing instruction, and doesn't even drain the store buffer. (Which you wouldn't necessarily want for timing anyway.) You can mfence; rdtscp
or lock or byte [rsp], 0 ; rdtscp
if you want that, or rdtscp; lfence
if you want to make sure its few uops can't reorder with later stuff.
See also this Q&A for more about the TSC in general, that it's a fixed frequency, not CPU cycles.
To answer the title question about "serializing instructions" in the x86 technical terminology sense,
Alder Lake (and Sapphire Rapids) and later have serialize
, which does exactly that and no more.
lfence
serializes instruction execution (drains the ROB but not store buffer): See
In a VM, cpuid
is a guaranteed vmexit so it's slow. It could possibly be faster to push RSP, RFLAGS, CS, and RIP, and run an iret
instruction. I didn't double-check what iret pops so that might not be exactly right.
Cross-modifying code is a case where a proper serializing instruction can matter vs. something like mfence
;lfence
. After an acquire load sees a release store indicating that the new code is there, you need to run a serializing instruction. Intel's Volume 3 manual, section 8.1.3, guarantees that's sufficient for cross-modifying code to be safe.
I assume that makes sure old code hasn't already been fetched by the front-end. So a serializing instruction might fully nuke the pipeline, or do the equivalent if there's enough tracking of recently instructions in the pipeline to snoop on L1i invalidations. (That extra snooping might not be worth the power since serializing instructions are hopefully rare. The tracking is needed anyway to handle self modifying code, snooping store addresses for being near any instruction in flight.)
mfence
(or a lock or byte [rsp],0
) + lfence
wouldn't necessarily be strong enough since lfence
only drains the ROB, concerned with instruction execution not fetch, and mfence
deals with data load/store. cpuid
is a good bet for this case if you can't use serialize
.
(Even an atomic RMW or atomic store within an aligned 8-byte chunk in the writer isn't sufficient. On some microarchitectures, I think unaligned code-fetch of 16-byte chunks from L1i cache is possible, so the reader might tear at any boundary.)
The answer is apparently not. The Intel Manual, Volume 3a lists only 3 non-privileged serializing instructions (cpuid
, iret
, and rsm
), and the latter two seem to have control-flow side-effects.
Well,I guess this is helpfull:lfence.Ref this 《64-ia-32-architectures-software-developer-manual》 Vol.2B 4-301