The use of the RDTSC instruction in a virtual machine is complicated. It is likely that the hypervisor (Xen) is emulating the RDTSC instruction by trapping it. Your fastest runs show around 800 cycles/cache line, which is very, very, slow... the only explanation is that the RDTSC results in a trap that is handled by the hypervisor, that overhead is a performance bottleneck. I'm not sure about the even longer time that you see periodically, but given that the RDTSC is being trapped, all timing bets are off.
You can read more about it here
http://xenbits.xen.org/docs/4.2-testing/misc/tscmode.txt
Instructions in the rdtsc family are non-privileged, but privileged software may set a cpuid bit to cause all rdtsc family instructions to trap. This trap can be detected by Xen, which can then transparently "emulate" the results of the rdtsc instruction and return control to the code following the rdtsc instruction
By the way, that article is wrong in that the hypervisor doesn't set a cpuid bit
to cause RDTSC to trap, it is bit #2 in Control Register 4 (CR4.TSD):