Question

I am trying to gain further improvement in my Image Resizing algorithm by combining IPP and TBB. The two ways that I can accomplish this task are:

  1. Use IPP without TBB
  2. Use IPP with TBB inside a parallel_for loop

My question is that I have coded the application, and I get correct result. But surprisingly, my computational time is larger when they are combined. To avoid clutter, I only paste part of my code in here. But I can provide the whole code if needed. For the first case when I use only IPP, the code is like: (The base of the algorithm was borrowed from the Intel TBB sample code for Image resizing)

ippiResizeSqrPixel_8u_C1R(src, srcSize, srcStep, srcRoi, dst, dstStep, dstRoi, 
m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBufferWhole);

and the parallel_for loop looks like this:

parallel_for(
    blocked_range<size_t>(0,CHUNK),
    [=](const blocked_range<size_t> &r){
        for (size_t i= r.begin(); i!= r.end(); i++){
            ippiResizeSqrPixel_8u_C1R(src+((int)(i*srcWidth*srcHeight)), srcSize, 
srcStep, srcRoi, dst+((int)(i*dstWidth*dstHeight)), dstStep, dstRoi, 
m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBuffer);
        }
    }
);

src and dst are pointers to the source image and the destination image. When TBB is used, the image is partitioned into CHUNKS parts and the parallel_for loops through all the CHUNKS and uses an IPP function to resize each CHUNK independently. The value for dstHeight, srcHeight, srcRoi, and dstRoi are modified to accommodate the partitioning of the image, and src+((int)(i*srcWidth*srcHeight)) and dst+((int)(i*dstWidth*dstHeight)) will point to the beginning of each partition in the source and destination image.

Apparently, IPP and TBB can be combined in this manner -- as I get the correct result -- but what baffles me is that the computational time deteriorates when they're combined compared to when IPP is used alone. Any thought on what could be the cause, or how I could solve this issue?

Thanks!

Was it helpful?

Solution 2

Turns out that some IPP functions use multi-threading automatically. For such functions no improvements can be gained out of using TBB. Apparently ippiResizeSqrPixel_8u_C1R( ... ) function is one of those functions. When I disabled all the cores but one, both versions did equally good.

OTHER TIPS

In your code, each parallelized task in parallel_for consists of multiple ippiResizeSqrPixel calls. This might be meaningless overhead as compared to serial version that calls only once, because such function may contain prepare phase (for example, setup interpolation coefficients table) and it's generally designed to process large memory block at a time for runtime efficiency. (but I don't know how IPP does actually.)

I suggest you following parallel structure:

parallel_for(
  // Range = src (or dst) height of image.
  blocked_range<size_t>(0, height),
  [=](const blocked_range<size_t> &r) {
    // 'r' = vertical range of image to process in this task.
    // You can calculate src/dst region from 'r' here,
    // and call ippiResizeSqrPixel once per task.
    ippiResizeSqrPixel_8u_C1R( ... );
  }
);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top