/Does/ memmove use an intermediate buffer?

Question 1

Indeed. "As if" means that it must behave like that's what it did; but doesn't constrain the implementation to actually do that. The only required behaviour is that the destination buffer ends up with the correct bytes from the source buffer, whether or not the buffers overlap.
A common implementation is to copy bytes forwards from the beginning of the buffer if the destination starts before the source, and backwards from the end otherwise. This ensures that the source bytes are always read before they're overwritten, if there is an overlap.
There are no errors, unless the allocation fails. The inefficiencies are allocating and freeing the temporary buffer, and copying each byte twice rather than once.

Question 2

You are absolutely right on the #1 - the description is to help users visualize what's going on logically, not to explain how it is implemented.

However, no sane implementation would actually go about doing it with an expensive temporary buffer, because all you need to do to avoid double-copying is to decide if you want to copy from the beginning or from the end. Here is a sample implementation that does precisely that.

The only problem with your algorithm is that it can run your system out of memory when it is not necessary: imagine that your program tries to move a buffer sized at 60% of the allowed total memory to see an example of when it would happen.

Question 3

One of the first optimization opportunities that comes to mind is to do a plain memcpy() if the buffers do not overlap. Due to the flat nature of the (virtual) address space, it's easy to check. I looked at the glibc and Android implementations, and both do this (the Android one is easier to follow for the uninitiated).

Allocating memory on the heap is probably a no-go, because it will be quite slow (dynamic allocation is not that cheap).

If the buffers do overlap, we could optimize copying of the parts which do not, and for the rest we might use a small scratch buffer, but that would be stack-allocated if indeed any allocation is needed at all. The Android one copies just one byte at a time; we can do better on amd64 but that's the same sort of optimization that would already be done in memcpy. The glibc one copies either forward or backward depending on the nature of the overlap (that's what "BWD" in the source refers to).

Question 4

The implementation does not matter. The wording is to guarantee that memory will properly be handled.

char buf[] = { 0x11, 0x22, 0x33, 0x00 };
memcpy(buf, buf + 1, 3);

could result in buf being { 0x11, 0x11, 0x11, 0x11 }.

where as

char buf[] = { 0x11, 0x22, 0x33, 0x00 };
memmove(buf, buf + 1, 3);

is guaranteed that buf will be { 0x11, 0x11, 0x22, 0x33 }.