The original fails because the "A" constraints means rax/eax/ax/al and/or rdx/edx/dx/dl and on x64 rdx only is allocated for the result, and the mov instructions therefore overwrite the address in rax.
You can get the result in two halves:
uint32_t lo, hi;
asm volatile(
"mov %%ebx, %%eax\n"
"mov %%ecx, %%edx\n"
"lock cmpxchg8b %2\n"
: "=&a" (lo), "=&d" (hi)
: "m" (v->counter64)
);
ret = lo | ((uint64_t)hi << 32);
However would an ordinary read suffice?
ret = *(volatile uint64_t)&v->counter64
Or are the memory ordering insufficient?