Question

I need to blend thousands of pairs of images very fast.

My code currently does the following: _apply is a function pointer to a function like Blend. It is one of the many functions we can pass, but it is not the only one. Any function takes two values and outputs a third and it is done on each channel for each pixel. I would prefer a solution that is general to any such function rather than a specific solution for blending.

typedef byte (*Transform)(byte src1,byte src2); 
Transform _apply;

for (int i=0 ; i< _frameSize ; i++) 
{
    source[i] = _apply(blend[i]);
}


byte Blend(byte src, byte blend)
{
    int resultPixel = (src + blend)/2;

    return (byte)resultPixel;
}

I was doing this on CPU but the performance is terrible. It is my understanding that doing this in GPU is very fast. My program needs to run in computers that will have either Nvidia GPUs or Intel GPUs so whatever solution I use needs to be vendor independent. If I use GPU it has to be OpenGL to be platform independent as well.

I think using a GLSL pixel shader would help, but I am not familiar with pixel shaders or how to use them to 2D objects (like my images).

Is that a reasonable solution? If so, how do I do this in 2D? If there is a library that already does that it is also great to know.

EDIT: I am receiving the image pairs from different sources. One is always coming from a 3d graphics component in opengl (so it is in GPU originally). The other one is coming from system memory, either from a socket (in a compressed video stream) or from a memory mapped file. The "sink" of the resulting image is the screen. I am expected to show the images on the screen, so going to GPU is an option or using something like SDL to display them.

The blend function that is going to be executed the most is this one

byte Patch(byte delta, byte lo)
{
    int resultPixel = (2 * (delta - 127)) + lo;

    if (resultPixel > 255)
       resultPixel = 255;

    if (resultPixel < 0)
       resultPixel = 0;

    return (byte)resultPixel;
}

EDIT 2: The image coming from GPU land, comes in this fashion. From FBO to PBO to system memory

glBindFramebuffer(GL_FRAMEBUFFER,fbo);
glReadBuffer( GL_COLOR_ATTACHMENT0 );
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGR,GL_UNSIGNED_BYTE,0); 
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo); 
void* mappedRegion = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);

Seems like it is probably better to just work everything in GPU memory. The other bitmap can come from system memory. We may get it from a video decoder in GPU memory eventually as well.

Edit 3: One of my images will come from D3D while the other one comes from OpenGL. It seems that something like Thrust or OpenCL is the best option

Was it helpful?

Solution

From the looks of your Blend function, this is an entirely memory bounded operation. The caches on the CPU can likely only hold a very small fraction of the thousands of images you have. Meaning most of your time is spent waiting for RAM to fulfill load/store requests, and the CPU will idle a lot.

You will NOT get any speedup by having to copy your images from RAM to GPU, have the GPU arithmetic units idle while they wait for GPU RAM to feed them data, wait for GPU RAM again to write results, then copy it all back to main RAM. Using GPU for this could actually slow things down substantially.


But I could be wrong and you might not be saturating your memory bus already. You will have to try it on your system and profile it. Here are some simple things you can try to optimize.

1. Multi-thread

I would focus on optimizing the algorithm directly on the CPU. The simplest thing is to go multi-threaded, which can be as simple as enabling OpenMP in your compiler and updating your for loop:

#include <omp.h> // add this along with enabling OpenMP support in your compiler
...
#pragma omp parallel for // <--- compiler magic happens here
for (int i=0 ; i< _frameSize ; i++) 
{
    source[i] = _apply(blend[i]);
}

If your memory bandwidth is not saturated, this will likely speed up the blending by however many cores your system has.

2. Micro-optimizations

Another thing you can try is to implement your Blend using SIMD instructions which most CPUs have nowadays. I can't help you with that without knowing what CPU you are targeting.

You can also try unrolling your for loop to mitigate some of the loop overhead.

One easy way to achieve both of these is leverage the Eigen matrix library by wrapping your data in their data structures.

// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = ...

// tell Eigen where you data/buffer are, and to treat it like a dynamic vectory of bytes
// this is a cheap shallow copy
Map<Matrix<byte, Dynamic,1> > sourceMap(source, _frameSize);
Map<Matrix<byte, Dynamic,1> > blendMap(blend, _frameSize);
Map<Matrix<byte, Dynamic,1> > resultMap(result, _frameSize);

// perform blend using all manner of insane optimization voodoo under the covers
resultMap = (sourceMap + blendMap)/2;

3. Use GPGPU

Finally, I will provide a direct answer to your question with an easy way to leverage the GPU without having to know much about GPU programming. The simplest thing to do is try the Thrust library. You will have to rewrite your algorithms as STL style algorithms, but that's pretty easy in your case.

// functor for blending
struct blend_functor
{
  template <typename Tuple>
  __host__ __device__
  void operator()(Tuple t)
  {
    // C[i] = (A[i] + B[i])/2;
    thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;
  }
};

// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = NULL;

// copy the data to the vectors on the GPU
thrust::device_vector<byte> A(source, source + _frameSize);
thrust::device_vector<byte> B(blend, blend + _frameSize);
// allocate result vector on the GPU
thrust::device_vector<byte> C(_frameSize);

// process the data on the GPU device
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(
                                  A.begin(), B.begin(), C.begin())),
                 thrust::make_zip_iterator(thrust::make_tuple(
                                  A.end(), B.end(), C.end())),
                 blend_functor());

// copy the data back to main RAM
thrust::host_vector<byte> resultVec = C;
result = resultVec.data();

A really neat thing about thrust is that once you have written the algorithms in a generic way, it can automagically use different back ends for doing the computation. CUDA is the default back end, but you can also configure it at compile time to use OpenMP or TBB (Intel threading library).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top