Semi-explicit parallelism in Haskell

Question 1

The par combinator you mention is part of the Glasgow parallel Haskell (GpH) which implements semi-explicit parallelism, which however means it is not fully implicit and hence does not provide automatic parallelisation. The programmer still needs to identify subexperssions deemed worthwhile executing in parallel to avoid the issue you mention in 1).

Moreover, the annotation is not prescriptive (as e.g. pthread_create in C or forkIO in Haskell) but advisory, that means the runtime system finally decides whether to evaluate the subexpressions in parallel or not. This provides additional flexibility and a way for dynamic granularity control. Additionally, so-called Evaluation Strategies have been designed to abstract over par and pseq and to separate specification of coordination from computation. For instance, a strategy parListChunk allows to chunk a list and force it to weak head normal form (this is a case where some strictness is needed).

2) Parallelism is limited by data dependencies in the sense that the computation defines the way the graph is reduced and which computation is demanded at which point. It is not true that every sub-expression must be independent. For instance E1 par E2 returns the result of E2, it means to be useful, some part of E1 needs to be used in E2 and hence E2 depends on E1.

3) The picture is slightly confused here because of the GHC-specific terminology. There are Capabilities (or Haskell Execution Contexts) which implement parallel graph reduction and maintain a spark pool and a thread pool each. Usually there is one Capability per core (can be thought of as OS threads). On the other end of the continuum there are sparks, which are basically pointers to parts of the graph that have not been evaluated yet (thunks). And there are threads (actually sort of tasks or work units), so that to be evaluated in parallel a spark needs to be turned into a thread (which has a so called thread state object that contains the necessary execution environment and allows a thunk to be evaluated in parallel). A thunk may depend on results of other thunks and blocks until these results arrive. These threads are much more lightweight than OS threads and are being multiplexed onto the available Capablities.

So, in summary, a runtime will not even neccesarily create a lightweight thread to evaluate a sub-expression. By the way, random work-stealing is used for load-balancing.

This is a very high-level approach to parallelism and avoids race conditions and deadlocks by design. The synchronisation is implicitly mediated through graph reduction. A nice position statement discusses further Why Parallel Functional Programming Matters. For more information on the abstract machine behind the scenes have a look at Stackless Tagless G-Machine and checkout the notes on Haskell Execution Model on GHC wiki (usually the most up-to-date documentation alongside the source code).

Question 2

Yes, you are correct. You would not gain anything by creating a spark for every expression you want computed. You would get way, way too many sparks. Trying to manage this is what Data Parallel Haskell is about. DPH is a way of breaking down a nested computations into well-sized chunks which can then be computed in parallel. Keep in mind that this is still a research effort and probably not ready for mainstream consumption.
Once again, you are correct. If a depends on b you have to compute as much of a as b needs to be able to start computation of b.
Yup. Threads actually have a pretty high overhead compared to some of the alternatives. Sparks are somewhat like thunks only they can be computedd independently of time.

No, the RTS will not create a thread to compute a. You can decide how many threads the RTS should have running (+RTS -N6 for six threads) and they will be kept alive for the duration of the program.

par only creates a spark. A spark is not a thread. The sparks occupy a work pool, and the scheduler performs work stealing – i.e. when a thread goes idle it picks up a spark from the pool and computes it.