I have to pass medical image data retrieved from one proprietary device SDK to an image processing function in another - also proprietary - device SDK from a second vendor.
The first function gives me the image in a planar rgb format:
int mrcpgk_retrieve_frame(uint16_t *r, uint16_t *g, uint16_t *b, int w, int h);
The reason for uint16_t is that the device can be switched to output each color value encoded as 16-bit floating point values. However, I'm operating in "byte mode" and thus the upper 8 bits of each color value are always zero.
The second function from another device SDK is defined like this:
BOOL process_cpgk_image(const PBYTE rgba, DWORD width, DWORD height);
So we get filled three buffers with the following bits: (16bit planar rgb)
R: 0000000 rrrrrrrr 00000000 rrrrrrrr ...
G: 0000000 gggggggg 00000000 gggggggg ...
B: 0000000 bbbbbbbb 00000000 bbbbbbbb ...
And the desired output illustrated in bits is:
RGBA: rrrrrrrrggggggggbbbbbbbb00000000 rrrrrrrrggggggggbbbbbbbb00000000 ....
We don't have access to the source code of these functions and cannot change the environment. Currently we have implemented the following basic "bridge" to connect the two devices:
void process_frames(int width, int height)
{
uint16_t *r = (uint16_t*)malloc(width*height*sizeof(uint16_t));
uint16_t *g = (uint16_t*)malloc(width*height*sizeof(uint16_t));
uint16_t *b = (uint16_t*)malloc(width*height*sizeof(uint16_t));
uint8_t *rgba = (uint8_t*)malloc(width*height*4);
int i;
memset(rgba, 0, width*height*4);
while ( mrcpgk_retrieve_frame(r, g, b, width, height) != 0 )
{
for (i=0; i<width*height; i++)
{
rgba[4*i+0] = (uint8_t)r[i];
rgba[4*i+1] = (uint8_t)g[i];
rgba[4*i+2] = (uint8_t)b[i];
}
process_cpgk_image(rgba, width, height);
}
free(r);
free(g);
free(b);
free(rgba);
}
This code works perfectly fine but processing takes very long for many thousands of high resolution images. The two functions for processing and retrieving are very fast and our bridge is currently the bottleneck.
I know how to do basic arithmetic, logical and shifting operations with SSE2 intrinsics but I wonder if and how this 16bit planar rgb to packed rgba conversion can be accelerated with MMX, SSE2 or [S]SSE3?
(SSE2 would be preferable because there are still some pre-2005 appliances in use).