memmove vs. copying individual array elements

Question 1

It all depends on your compiler and other implementation details. It is true that memmove can be implemented in some tricky super-optimized way. But at the same time a smart compiler might be able to figure out what your per-element copying code is doing and optimize it in the same (or very similar) way. Try it and see for yourself.

Question 2

The implementation behind memmove() might be more optimized in your C library. Some architectures have instructions for moving whole blocks of memory at once very efficiently. The theoretical running-time complexity won't be improved, but it may still run faster in real life.

Question 3

memmove would be perfectly tuned to make maximum use of the available system resources (unique for each implementation, of course).

Here is a little quote from Expert C Programming - Deep C Secrets on the difference between using a loop and using memcpy (preceding it are two code snippets one copying a source into a destination using a for loop and another memcpy):

In this particular case both the source and destination use the same cache line, causing every memory reference to miss the cache and stall the processor while it waited for regular memory to deliver. The library memcpy() routine is especially tuned for high performance. It unrolls the loop to read for one cache line and then write, which avoids the problem. Using the smart copy, we were able to get a huge performance improvement. This also shows the folly of drawing conclusions from simple-minded benchmark programs.

This dates back from 1994 but it still illustrates how much better optimised the standard library functions are compared to anything you roll on your own. The loop case took around 7 seconds to run versus 1 for the memcpy.

While memmove will be only slightly slower than memcpy due to the assumptions it needs to make about the source and destination (in memcpy they cannot overlap) it should still be far superior to any standard loop.

Note that this does not affect complexity (as it's been pointed out by another poster). Complexity does not depend on having a bigger cache or an unrolled loop :)

As requested here are the code snippets (slightly changed):

#include <string.h>
#define DUMBCOPY for (i = 0; i < 65536; i++) destination[i] = source[i] 

#define SMARTCOPY memcpy(destination, source, 65536) 
int main() 
{ 
    char source[65536], destination[65536]; 
    int i, j; 
    for (j = 0; j < 100; j++) 
        DUMBCOPY; /* or put SMARTCOPY here instead */
    return 0;
}

On my machine (32 bit, Linux Mint, GCC 4.6.3) I got the following times:

Using SMARTCOPY:

$ time ./a.out 
real    0m0.002s
user    0m0.000s
sys     0m0.000s

Using DUMBCOPY:

$ time ./a.out 
real    0m0.050s
user    0m0.036s
sys     0m0.000s

Question 4

You can not beat memcpy with C implementation. Because it is written in asm and with a good algorithms.

If you write asm code for a specific cpu in mind, and develop a good algorithms considering cache, you may have a chance.

Standard library functions are so well optimized, it is always better to use them.