Fast Converting RGBA to ARGB

https://stackoverflow.com/questions/11259391

18-06-2021
|

Question

I am trying to convert a rgba buffer into argb, is there any way to improve the next algorithm, or any other faster way to perform such operation? Taking into account that the alpha value is not important once in the argb buffer, and should always end up as 0xFF.

int y, x, pixel;

for (y = 0; y < height; y++)
{
    for (x = 0; x < width; x++)
    {
     pixel = rgbaBuffer[y * width + x];
     argbBuffer[(height - y - 1) * width + x] = (pixel & 0xff00ff00) | ((pixel << 16) & 0x00ff0000) | ((pixel >> 16) & 0xff);
    }
}

Solution

I will focus only in the swap function:

typedef unsigned int Color32;

inline Color32 Color32Reverse(Color32 x)
{

    return
    // Source is in format: 0xAARRGGBB
        ((x & 0xFF000000) >> 24) | //______AA
        ((x & 0x00FF0000) >>  8) | //____RR__
        ((x & 0x0000FF00) <<  8) | //__GG____
        ((x & 0x000000FF) << 24);  //BB______
    // Return value is in format:  0xBBGGRRAA
}

OTHER TIPS

Assuming that the code is not buggy (just inefficient), I can guess that all you want to do is swap every second (even-numbered) byte (and of course invert the buffer), isn't it?

So you can achieve some optimizations by:

Avoiding the shift and masking operations
Optimizing the loop, eg economizing in the indices calculations

I would rewrite the code as follows:

int y, x;

for (y = 0; y < height; y++)
{
    unsigned char *pRGBA= (unsigned char *)(rgbaBuffer+y*width);
    unsigned char *pARGB= (unsigned char *)(argbBuffer+(height-y-1)*width);
    for (x = 4*(width-1); x>=0; x-=4)
    {
        pARGB[x  ]   = pRGBA[x+2];
        pARGB[x+1]   = pRGBA[x+1];
        pARGB[x+2]   = pRGBA[x  ];
        pARGB[x+3]   = 0xFF;
    }
}

Please note that the more complex indices calculation is performed in the outer loop only. There are four acesses to both rgbaBuffer and argbBuffer for each pixel, but I think this is more than offset by avoiding the bitwise operations and the indixes calculations. An alternative would be (like in your code) fetch/store one pixel (int) at a time, and make the processing locally (this econimizes in memory accesses), but unless you have some efficient way to swap the two bytes and set the alpha locally (eg some inline assembly, so that you make sure that everything is performed at registers level), it won't really help.

Code you provided is very strange since it shuffles color components not rgba->argb, but rgba->rabg.

I've made a correct and optimized version of this routine.

int pixel;
int size = width * height;

for (unsigned int * rgba_ptr = rgbaBuffer, * argb_ptr = argbBuffer + size - 1; argb_ptr >= argbBuffer; rgba_ptr++, argb_ptr--)
{
    // *argb_ptr = *rgba_ptr >> 8 | 0xff000000;  // - this version doesn't change endianess
    *argb_ptr = __builtin_bswap32(*rgba_ptr) >> 8 | 0xff000000;  // This does
}

The first thing i've made is simplifying your shuffling expression. It is obvious that XRGB is just RGBA >> 8. Also i've removed calculation of array index on each iteration and used pointers as loop variables. This version is about 2 times faster than the original on my machine.

You can also use SSE for shuffling if this code is intended for x86 CPU.

I am very late to this one. But I had the exact same problem when generating video on the fly. By reusing the buffer, I could get away with only setting the R, G, B values for every frame and only setting the A once.

See below code:

byte[] _workingBuffer = null;
byte[] GetProcessedPixelData(SKBitmap bitmap)
{
    ReadOnlySpan<byte> sourceSpan = bitmap.GetPixelSpan();

    if (_workingBuffer == null || _workingBuffer.Length != bitmap.ByteCount)
    {
        // Alloc buffer
        _workingBuffer = new byte[sourceSpan.Length];

        // Set all the alpha
        for (int i = 0; i < sourceSpan.Length; i += 4) _workingBuffer[i] = byte.MaxValue;
    }

    Stopwatch w = Stopwatch.StartNew();
    for (int i = 0; i < sourceSpan.Length; i += 4)
    {
        // A
        // Dont set alpha here. The alpha is already set in the buffer
        //_workingBuffer[i] = byte.MaxValue;
        //_workingBuffer[i] = sourceSpan[i + 3];

        // R
        _workingBuffer[i + 1] = sourceSpan[i];

        // G
        _workingBuffer[i + 2] = sourceSpan[i + 1];

        // B
        _workingBuffer[i + 3] = sourceSpan[i + 2];
    }
    Debug.Print("Copied " + sourceSpan.Length + " in " + w.Elapsed.TotalMilliseconds);

    return _workingBuffer;
}

This got me to around 15 milliseconds on an iPhone for a (1920 * 1080 * 4) buffer which is ~8mb.

This was not nearly enough for me. My final solution was instead to do a offset memcopy (Buffer.BlockCopy in C#) since the alpha is not important.

    byte[] _workingBuffer = null;
    byte[] GetProcessedPixelData(SKBitmap bitmap)
    {
        ReadOnlySpan<byte> sourceSpan = bitmap.GetPixelSpan();
        byte[] sourceArray = sourceSpan.ToArray();

        if (_workingBuffer == null || _workingBuffer.Length != bitmap.ByteCount)
        {
            // Alloc buffer
            _workingBuffer = new byte[sourceSpan.Length];

            // Set first byte. This is the alpha component of the first pixel
            _workingBuffer[0] = byte.MaxValue;
        }

        // Converts RGBA to ARGB in ~2 ms instead of ~15 ms
        // 
        // Copies the whole buffer with a offset of 1
        //                                      R   G   B   A   R   G   B   A   R   G   B   A
        // Originally the source buffer has:    R1, G1, B1, A1, R2, G2, B2, A2, R3, G3, B3, A3
        //                                   A  R   G   B   A   R   G   B   A   R   G   B   A
        // After the copy it looks like:     0, R1, G1, B1, A1, R2, G2, B2, A2, R3, G3, B3, A3
        // So essentially we get the wrong alpha for every pixel. But all alphas should be 255 anyways.
        // The first byte is set in the alloc
        Buffer.BlockCopy(sourceArray, 0, _workingBuffer, 1, sourceSpan.Length - 1);

        // Below is an inefficient method of converting RGBA to ARGB. Takes ~15 ms on iPhone 12 Pro Max for a 8mb buffer (1920 * 1080 * 4 bytes)
        /*
        for (int i = 0; i < sourceSpan.Length; i += 4)
        {
            // A
            // Dont set alpha here. The alpha is already set in the buffer
            //_workingBuffer[i] = byte.MaxValue;
            //_workingBuffer[i] = sourceSpan[i + 3];

            byte sR = sourceSpan[i];
            byte sG = sourceSpan[i + 1];
            byte sB = sourceSpan[i + 2];

            if (sR == 0 && sG == byte.MaxValue && sB == 0)
                continue;

            // R
            _workingBuffer[i + 1] = sR;

            // G
            _workingBuffer[i + 2] = sG;

            // B
            _workingBuffer[i + 3] = sB;
        }
        */

        return _workingBuffer;
    }

The code is commented on how this works. On my same iPhone it takes ~2 ms which is sufficient for my use case.

Use assembly, the following is for Intel.

This example swaps Red and Blue.

void* b = pixels;
UINT len = textureWidth*textureHeight;

__asm                                                       
{
    mov ecx, len                // Set loop counter to pixels memory block size
    mov ebx, b                  // Set ebx to pixels pointer
    label:                      
        mov al,[ebx+0]          // Load Red to al
        mov ah,[ebx+2]          // Load Blue to ah
        mov [ebx+0],ah          // Swap Red
        mov [ebx+2],al          // Swap Blue
        add ebx,4               // Move by 4 bytes to next pixel
        dec ecx                 // Decrease loop counter
        jnz label               // If not zero jump to label
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow