Parallelizing a for loop gives no performance gain

Question 1

First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.

Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.

Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:

When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.

What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.

You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).

I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):

Native_For       0.21 s
OpenMP_For       0.15 s
Intel_TBB_For    0.15 s
MS_PPL_For       0.15 s

If remove memset then:

Native_For       0.15 s
OpenMP_For       0.09 s
Intel_TBB_For    0.09 s
MS_PPL_For       0.09 s

As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.

FULL SOURCE CODE:

#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>

using namespace boost;
using namespace std;

const auto Width = 3264;
const auto Height = 2540*8;

struct MS_PPL_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        concurrency::parallel_for(first,last,f);
    }
};

struct Intel_TBB_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        tbb::parallel_for(first,last,f);
    }
};

struct Native_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        for(; first!=last; ++first) f(first);
    }
};

struct OpenMP_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        #pragma omp parallel for
        for(auto i=first; i<last; ++i) f(i);
    }
};

template<typename T>
struct ConvertBayerToRgbImageAsIs
{
    const T* BayerChannel;
    T* RgbChannel;
    template<typename For>
    void operator()(For for_)
    {
        cout << type_name<For>() << "\t";
        progress_timer t;
        int offsets[] = {2,1,1,0};
        //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        for_(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row % 2)*2 + (col % 2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
    }
};

int main()
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
    for(auto i=0;i!=4;++i)
    {
        mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
        cout << string(16,'_') << endl;
    }
}

Question 2

Synchronization overhead

I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.

Cache usage

An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results

 bad       good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2

Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.

Code optimizations

the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.

    // is the memset really necessary?
    //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
    parallel_for(0, Height, [&] (int row)
    {
        int rowMod = (row & 1) << 1;
        for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
        {
            auto rgbIndex = tripleBayerIndex + offsets[rowMod];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex];

            //unrolled the loop to save col & 1 operation
            rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
        }
    });

Question 3

Here comes my suggestion:

Computer larger chunks in parallel
get rid of modulo/multiplication

unroll inner loop to compute one full pixel (simplifies code)

template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
{
    // convert BGGR->RGB
    // have as many threads as the hardware concurrency is
    parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
    {
        for (auto row = stride; row<2*stride; row++)
        {
            for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
            {
                RgbChannel[rgbCol+0]  = BayerChannel[col+3];
                RgbChannel[rgbCol+1]  = BayerChannel[col+1];
                // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
                RgbChannel[rgbCol+2]  = BayerChannel[col+0];
            }
        }
    });
}

This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.

edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:

int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
    res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
    res[i].wait();
}
delete [] res;

with converter being a simple class:

template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) : 
    BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
    // convert BGGR->RGB
    for(int row = startRow; row < endRow; row++)
    {
        for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
        {
            RgbChannel[rgbCol+0]  = BayerChannel[col+3];
            RgbChannel[rgbCol+1]  = BayerChannel[col+1];
            // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
            RgbChannel[rgbCol+2]  = BayerChannel[col+0];
        }
    };
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};

This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.

Question 4

I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.

I tested your code exactly using three different methods, averaged over 1000 tries.

Serial:      14.6 += 1.0  ms
std::async:  13.6 += 1.6  ms
workers:     11.8 += 1.2  ms

The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.

The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.

For reference, here's my std::async parallel for:

template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
    auto N = std::thread::hardware_concurrency();
    std::vector<std::future<void>> futures;

    for (Int ti=0; ti<N; ++ti) {
        Int b = ti * (end - beg) / N;
        Int e = (ti+1) * (end - beg) / N;
        if (ti == N-1) { e = end; }

        futures.emplace_back( std::async([&,b,e]() {
            for (Int ix=b; ix<e; ++ix) {
                fun( ix );
            }
        }));
    }

    for (auto&& f : futures) {
        f.wait();
    }
}

Question 5

Things to check or do

Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
How are you timing your code?
- The timing should be done inside the application like Evgeny Panasyuk suggests.
- You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
Remove the superfluous memset, as others have explained.
As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
Make sure your machine is otherwise idle when doing performance testing.

Additional timings

I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win32 implementation:

#include "stdafx.h"

#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>

using namespace std;

const int Width = 3264;
const int Height = 2540*8;

class Timer {
private:
    string name;
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    LARGE_INTEGER frequency;
public:
    Timer(const char *name) : name(name) {
        QueryPerformanceFrequency(&frequency);
        QueryPerformanceCounter(&start);
    }

    ~Timer() {
        QueryPerformanceCounter(&stop);
        LARGE_INTEGER time;
        time.QuadPart = stop.QuadPart - start.QuadPart;
        double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
        printf("%-20s : %5.2f\n", name.c_str(), elapsed);
    }
};

static const int offsets[] = {2,1,1,0};

template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = 0, bayerIndex = row * Width;
         col < Width; col++, bayerIndex++)
    {
        int offset = (row % 2)*2 + (col % 2); //0...3
        int rgbIndex = bayerIndex * 3 + offsets[offset];
        RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
    }
}

// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = row*Width, rgbCol =row*Width;
         col < row*Width+Width; rgbCol +=3, col+=4)
    {
        RgbChannel[rgbCol+0]  = BayerChannel[col+3];
        RgbChannel[rgbCol+1]  = BayerChannel[col+1];
        // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
        RgbChannel[rgbCol+2]  = BayerChannel[col+0];
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    for(int i = 0; i < 4; ++i)
    {
        {
            Timer t("serial_orig");
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_orig");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_orig");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }

        {
            Timer t("serial_unrolled");
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_unrolled");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_unrolled");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        printf("-----------------------------\n");
    }
    return 0;
}

Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:

serial_orig          :  0.13
omp_dynamic_orig     :  0.10
omp_static_orig      :  0.10
serial_unrolled      :  0.06
omp_dynamic_unrolled :  0.04
omp_static_unrolled  :  0.04

The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.

Question 6

The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.

Question 7

Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.

Can you break up your array access into 4k chunks somewhat align with the page boundary?

Question 8

There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)

    parallel_for(0, Height, [=] (int row) noexcept
    {
        for (auto col=0, bayerindex=row*Width,
                  rgb0=3*bayerindex+offset[(row%2)*2],
                  rgb1=3*bayerindex+offset[(row%2)*2+1];
             col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
        {
            RgbChannel[rgb0] = BayerChannel[bayerindex  ];
            RgbChannel[rgb1] = BayerChannel[bayerindex+1];
        }
    });