Operations on 2 images (adding, subtraction etc. ), without using a buffer [closed]

Question 1

The code you shared has many problems and in the comment section you state that it doesn't even work. I think you should focus on solving one problem at a time, and when the code actually works then it makes sense to try to make it faster.

Your application retrieves the width from one image and the height from the other. This rarely leads to good things.

    uint32_t width = im1.GetWidth();
    uint32_t height = im2.GetHeight();

Alright, so buffer1 points to im1, and p1 points to buffer1. I think you don't really need p1, just use buffer1 instead.

    uint8_t* buffer1 = static_cast<uint8_t*>( im1.GetBuffer());
    uint8_t* p1 = buffer1;

And now buffer2 and p2 points to im1. What?! Shouldn't it be im2??? You don't really need p2.

    uint8_t* buffer2 = static_cast<uint8_t*>( im1.GetBuffer());
    uint8_t* p2 = buffer2;


    for (uint32_t y = 0; y < height; ++y)
    {

The next loop increments p, which is a variable that wasn't declared. I suppose you tried to increment p1.

        for (uint32_t x = 0; x < width; ++x, ++p)
        {
            *p2 = (uint8_t)*p1+*p2;
            ++p2;
        }
    }

Right now it doesn't make sense to display im2 since it wasn't modified by the code.

    ShowImage( im2, "Mixed image");

One more thing, if im1 and im2 have different sizes it could lead to a crash.

I strongly suggest you take a look at the following post to know how to ask better questions and get people to help you: Short, Self Contained, Correct (Compilable), Example

There is a few technologies that can speed up the processing of those arithmetic operations:

If you have an Intel CPU: Intel® Threading Building Blocks (Intel® TBB);
If you have an Intel CPU: Intel® Integrated Performance Primitives (Intel® IPP);
If you have a GPU that supports OpenGL, you can write your own GLSL shader;
If you have a GPU that supports DirectX, you can write your own HLSL shader;
If you have an NVIDIA GPU: CUDA™;
If you have an NVIDIA/ATI GPU: OpenCL;
You can try Eigen, a C++ template library for linear algebra (performs optimized operations on matrices);
OpenMP® (a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs);
At last but not least, you can always write your own assembly code to perform the arithmetic operations.

Question 2

Before you start optimising, make sure that your output is correct!

The expression

*p2 = (uint8_t)*p1+*p2;

will overflow and give you wrong results. The cast (uint8_t) will not magically clip your values to a valid range, but only convert your result of the addition. In this case the cast does not do anything, since the operands are uint8_t.

const uint16_t a = *p1;
const uint16_t b = *p2;
const uint16_t sum = a+b;
*p2 = static_cast<uint8_t>( sum > 255 ? 255 : sum );

Better yet, add the results and divide by two, this way you stay in a valid range, only loose the LSB and it's branchless.

*p2 = static_cast<uint8_t>( sum >> 1 );

Some more tips you could try before you have to use a different technique.

Use a compiler (vc>=2012,gcc>=4.7) which supports Auto-Vectorisation and turn it on.
If you are compiling for windows 32bit use "/arch:SSE2"
Give the compilers hints by using const and restrict.
if you are sure that the window size is always the same, use fixed width and height

e.g.

void add( const CPylonImage& im1, CPylonImage& im2 )
{
    const int w = 1294; //im1.width();
    const int h = 964; //im1.height();

    const uint8_t* restrict buffer1 = static_cast<uint8_t*>( im1.getBuffer() );
    uint8_t* restrict buffer2 = static_cast<uint8_t*>( im2.getBuffer() );
    for( int i = 0; i < w*h; i++ )
    {
        const uint16_t a = buffer1[i];
        const uint16_t b = buffer2[i];
        const uint16_t sum = a+b >> 1;
        buffer2[i] = static_cast<uint8_t>( sum );
    }
}