If the question is, whether it's possible to encounter an efficient memcpy implementation with enough undefined behavior to not trust it over overlapping ranges, then the answer is yes. :-)
Consider one possible implementation of memcpy on Power(PC) architecture: lmw instruction will load multiple consecutive words from memory into consecutive registers (which can be specified as a user defined range argument). stmw will then save the supplied register range back to memory. Thus, we are talking around ~100/200 bytes (32b/64b CPU) buffered by the CPU during a single memcpy iteration - plenty of data to spoil the target range if it overlaps with the source one, especially considering that CPU makes no promises about relative order of individual load and stores.