It all depends on your compiler and other implementation details. It is true that memmove
can be implemented in some tricky super-optimized way. But at the same time a smart compiler might be able to figure out what your per-element copying code is doing and optimize it in the same (or very similar) way. Try it and see for yourself.
memmove vs. copying individual array elements
-
02-06-2022 - |
문제
In CLRS chapter 2 there is an exercise which asks whether the worst-case running time of insertion sort be improved to O(n lg n)
. I saw this question and found that it cannot be done.
The worst-case complexity cannot be improved but would by using memmove
the real running time be better compared to individually moving the array elements?
Code for individually moving elements
void insertion_sort(int arr[], int length)
{
/*
Sorts into increasing order
For decreasing order change the comparison in for-loop
*/
for (int j = 1; j < length; j++)
{
int temp = arr[j];
int k;
for (k = j - 1; k >= 0 && arr[k] > temp; k--){
arr[k + 1] = arr[k];
}
arr[k + 1] = temp;
}
}
Code for moving elements by using memmove
void insertion_sort(int arr[], int length)
{
for (int j = 1; j < length; j++)
{
int temp = arr[j];
int k;
for (k = j - 1; k >= 0 && arr[k] > temp; k--){
;
}
if (k != j - 1){
memmove(&arr[k + 2], &arr[k + 1], sizeof(int) *(j - k - 2));
}
arr[k + 1] = temp;
}
}
I couldn't get the 2nd one to run perfectly but that is an example of what I am thinking of doing.
Would there be any visible speed improvements by using memmove
?
해결책 3
다른 팁
The implementation behind memmove()
might be more optimized in your C library. Some architectures have instructions for moving whole blocks of memory at once very efficiently. The theoretical running-time complexity won't be improved, but it may still run faster in real life.
memmove
would be perfectly tuned to make maximum use of the available system resources (unique for each implementation, of course).
Here is a little quote from Expert C Programming - Deep C Secrets on the difference between using a loop and using memcpy
(preceding it are two code snippets one copying a source into a destination using a for
loop and another memcpy
):
In this particular case both the source and destination use the same cache line, causing every memory reference to miss the cache and stall the processor while it waited for regular memory to deliver. The library
memcpy()
routine is especially tuned for high performance. It unrolls the loop to read for one cache line and then write, which avoids the problem. Using the smart copy, we were able to get a huge performance improvement. This also shows the folly of drawing conclusions from simple-minded benchmark programs.
This dates back from 1994 but it still illustrates how much better optimised the standard library functions are compared to anything you roll on your own. The loop case took around 7 seconds to run versus 1 for the memcpy
.
While memmove
will be only slightly slower than memcpy
due to the assumptions it needs to make about the source and destination (in memcpy
they cannot overlap) it should still be far superior to any standard loop.
Note that this does not affect complexity (as it's been pointed out by another poster). Complexity does not depend on having a bigger cache or an unrolled loop :)
As requested here are the code snippets (slightly changed):
#include <string.h>
#define DUMBCOPY for (i = 0; i < 65536; i++) destination[i] = source[i]
#define SMARTCOPY memcpy(destination, source, 65536)
int main()
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; j++)
DUMBCOPY; /* or put SMARTCOPY here instead */
return 0;
}
On my machine (32 bit, Linux Mint, GCC 4.6.3) I got the following times:
Using SMARTCOPY:
$ time ./a.out
real 0m0.002s
user 0m0.000s
sys 0m0.000s
Using DUMBCOPY:
$ time ./a.out
real 0m0.050s
user 0m0.036s
sys 0m0.000s
You can not beat memcpy with C implementation. Because it is written in asm and with a good algorithms.
If you write asm code for a specific cpu in mind, and develop a good algorithms considering cache, you may have a chance.
Standard library functions are so well optimized, it is always better to use them.