Question

I've been going through this tutorial on parallel pipelines and noticed that, while there is definitely a considerable difference in throughput, couldn't it be even better if the compression stage also took on a read job since it's just waiting around anyway? The same thing goes for the write stage... I mean, why not take on a third compression and then switch over to writing two, and then have one of those cores go back to compressing while the other wraps up the third write, and so on?

I apologize if this is obvious. I imagine this is standard practice and is called something, I'm just not sure what. Is their any overhead involved with switching jobs like this?

And I know this might be the wrong forum for this last question, but can the GPU switch jobs like this or should the programmable shaders/CUDA cores pretty much be left alone after being programmed?

EDIT: I guess I also don't understand how taking the same six-cores used in the 2 cores/stage example would be faster than just giving each of the six cores all three stages. Sure, there would be two cores that would do two, but that's still faster than the top scenario. I would understand it better in the GPU's case since there is specialized hardware involved for certain computations, but generally speaking, I don't see it. Maybe this example is weak or something because I know the parallel processing is here to stay.

Parallel Pipeline

Was it helpful?

Solution

This is definitely an issue with pipelining and there are a number of different ways to try and mitigate it.

With specialized hardware the hardware will often be tuned to try and balance the time taken in each stage for typical workloads. Fixed function stages in GPUs for example are typically balanced around the needs of a sample of representative game rendering workloads with transistors being allocated to try and balance the time taken in each stage. With static balancing like this there will usually be some wasted performance still however.

An alternative approach that can be used in both software and hardware to balance a pipeline is to break the longer stages down into multiple shorter steps. This is a common strategy in CPU instruction pipelines but can also be useful in software. In your example, the longer running compression step could potentially be broken down into multiple shorter pipeline stages. Depending on the task this may be difficult or impossible to do efficiently however.

Task scheduling systems can be used to help balance workloads across CPUs in a software pipeline. In a task scheduling system, you have a number of worker threads (usually around one per hardware thread) and any task can run on any worker thread. You have an API to set up dependencies between tasks and the task scheduler is responsible for scheduling tasks to run wherever CPU time is available once their dependencies are satisfied. In your example, the cores with idle time running the Read and Write tasks could help out with Compress tasks rather than sitting idle as long as the Compress tasks had their Read task dependencies satisfied.

Traditional OS thread schedulers can give some of the same benefits of a task scheduling system. In your example, if the Read threads waited on a semaphore when their work queues were empty (to be signalled when new work was added to the queues), the OS could schedule Compress threads to run on those idle cores. This can work reasonably well for relatively long running pipeline stages (10s of milliseconds) but for shorter pipeline stages (sub 1ms) the overhead of the OS thread scheduling and the length of the thread time slice will likely mean a task scheduling system would give better performance.

OTHER TIPS

Your points are valid. The tutorial is lacking.

If the read, compress, and write operations can all occur at once, independently, the simple non-pipelined case would be the fastest for the six cores. Also notice that in the six core diagram, the reads and writes never overlap, so they could be the same ones. You only need four cores.

But consider a case where the reads all access the same disk so issuing too many read operations in parallel makes the reads take longer because they interfere with each other. In this case you can gain by pipelining the reads since you start the first compress steps sooner and they limit the overall performance.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top