Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?
Edit 2: There is no function for strided copying in the C++ libraries.
Since strided copying is not as popular a memory copying, chip manufacturers nor language designs have specialized support for strided copying.
Assuming a standard for
loop, you may be able to gain some performance by using Loop Unrolling. Some compilers have options to unroll loops; it's not a "standard" option.
Given a standard for
loop:
#define RESULT_SIZE 72
#define SIZE_A 48
#define SIZE_B 24
unsigned int A[SIZE_A];
unsigned int B[SIZE_B];
unsigned int result[RESULT_SIZE];
unsigned int index_a = 0;
unsigned int index_b = 0;
unsigned int index_result = 0;
for (index_result = 0; index_result < RESULT_SIZE;)
{
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
}
Loop unrolling would repeat the contents of the "standard" for
loop:
for (index_result = 0; index_result < RESULT_SIZE;)
{
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
}
In the unrolled version, the number of loops has been cut in half.
The performance improvement may be negligible compared to other options. The following issues affect performance and each may have different speed improvements:
- Processing data cache misses
- Reloading of instruction pipeline (depends on processor)
- Operating System swapping memory with disk
- Other tasks running concurrently
- Parallel processing (depends on processor / platform)
One example of parallel processing is to have one processor copy the B items to the new array and another processor copy the A items to the new array.