Copying strided data in C++

Question 1

Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?

Edit 2: There is no function for strided copying in the C++ libraries.

Since strided copying is not as popular a memory copying, chip manufacturers nor language designs have specialized support for strided copying.

Assuming a standard for loop, you may be able to gain some performance by using Loop Unrolling. Some compilers have options to unroll loops; it's not a "standard" option.

Given a standard for loop:

#define RESULT_SIZE 72
#define SIZE_A 48
#define SIZE_B 24

unsigned int A[SIZE_A];
unsigned int B[SIZE_B];
unsigned int result[RESULT_SIZE];

unsigned int index_a = 0;
unsigned int index_b = 0;
unsigned int index_result = 0;
for (index_result = 0; index_result < RESULT_SIZE;)
{
   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 
}

Loop unrolling would repeat the contents of the "standard" for loop:

for (index_result = 0; index_result < RESULT_SIZE;)
{
   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 

   result[index_result++] = B[index_b++];
   result[index_result++] = A[index_a++];
   result[index_result++] = A[index_a++]; 
}

In the unrolled version, the number of loops has been cut in half.

The performance improvement may be negligible compared to other options. The following issues affect performance and each may have different speed improvements:

Processing data cache misses
Reloading of instruction pipeline (depends on processor)
Operating System swapping memory with disk
Other tasks running concurrently
Parallel processing (depends on processor / platform)

One example of parallel processing is to have one processor copy the B items to the new array and another processor copy the A items to the new array.

Question 2

Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.

The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2 bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:

Level 1: Regular

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 only, takes about 1.5 ms on average.

Level 2: Unrolled and squeezed a bit more with moving pointers

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    unsigned char* endptr = dstptr + stride;
    while(dstptr<endptr){
        *(dstptr + 0)             = *(srcptr + 0);
        *(dstptr + stride + 0)    = *(srcptr + 1);
        *(dstptr + 1)             = *(srcptr + 2);
        *(dstptr + stride + 1)    = *(srcptr + 3);
        *(dstptr + 2)             = *(srcptr + 4);
        *(dstptr + stride + 2)    = *(srcptr + 5);
        *(dstptr + 3)             = *(srcptr + 6);
        *(dstptr + stride + 3)    = *(srcptr + 7);
        *(dstptr + 4)             = *(srcptr + 8);
        *(dstptr + stride + 4)    = *(srcptr + 9);
        *(dstptr + 5)             = *(srcptr + 10);
        *(dstptr + stride + 5)    = *(srcptr + 11);
        *(dstptr + 6)             = *(srcptr + 12);
        *(dstptr + stride + 6)    = *(srcptr + 13);
        *(dstptr + 7)             = *(srcptr + 14);
        *(dstptr + stride + 7)    = *(srcptr + 15);
        srcptr+=16;
        dstptr+=8;
    } 
}

Compiled with -O3 only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.

Level 3: Regular + GCC automatic NEON vectorization

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp, takes about 0.6 ms on average. For reference, a memcpy of 640*480 bytes, or double the amount of what's tested here, takes about 0.6 ms on average.

As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.