Question

When doing (possibly heavy) pixel processing on a large image, multithreading becomes a must. The standard practice is to initiate a loop whose indices are partitioned into multiple threads within a thread pool. The performance benefits become immediately apparent after taking the appropriate thread-safety measures to ensure correctness of results.

However, there are multiple possible configurations how one can partition the indices. The most common methods are partitioning by row or by pixel. Here is my interpretation of the advantages and drawbacks of each:

By Row:

  • Less thread creation overhead

  • Thread load may not be even due to the number of rows possibly not being divisible by the number of threads. This can cause an image that is wide but not tall to be processed inefficiently across multiple cores

By Pixel:

  • More thread creation overhead

  • Thread load can be distributed more evenly due to the fact that the time taken to process the indices that are not divisible by the number of threads is relatively small

Is my interpretation correct, or is there more to the story? Should I always choose one over the other?

For reference, I am using the Parallel.For() function in C#.

Was it helpful?

Solution

I use an approach where each task gets ltrb rect along with the pixels to represent a rectangular region of the image to process.

That allows me to take those cases where an image is much wider than it is tall and still split it up into rectangular chunks to process with, say, 1024 pixels to process per thread. For small images with less than 1024 pixels total, I don't even bother applying a parallel for loop since I've found it's generally cheaper to just use a single-threaded for loop in those cases.

Typically you won't get such good performance trying to assign one pixel per task. At least with libraries like OMP and TBB, you need a sufficient amount of work to do in each task or else the overhead of scheduling the tasks will outweigh the benefits of multithreading to the point where you can easily get worse than single-threaded performance.

Also unless your image algorithms don't care about the positions of the pixels they're processing, that carries another overhead of having to pass the pixel coordinate along per pixel.

So I recommend processing in rectangular chunks like I do or even having each thread process a scanline isn't bad either and will generally be good enough for the common cases.

OTHER TIPS

You should consider moving this work to the GPU. It was designed for massively parallel processing and many image processing tasks fall into this category. It also often has thousands of cores instead of just a few dozen like modern CPUs. The overhead of uploading to the card and downloading to memory is often dwarfed by the speed of processing, unless the processing is extremely simple. The GPU cores will often operate on a fragment at a time, which is usually something like a 2x2 pixel area. In my experience it's easy to sustain 30 fps for HD (1920 x 1080) video footage and hitting 60 isn't too difficult. It's possible to do realtime processing of 4K footage, in many cases, too.

For an example of an extremely efficient use of the GPU for image processing, I recommend looking at Apple's CoreImage and in particular their CIFilter class. Even if you're not working on macOS, the ideas can be applied on other systems. There are small kernels expressed as fragment shaders. They allow concatenation of these shaders to reduce the number of intermediate buffers involved.

Licensed under: CC-BY-SA with attribution
scroll top