Is there a race condition in the `latch` sample in N3600?

https://stackoverflow.com/questions/16241463

13-04-2022
|

Question

Proposed for inclusion in C++14 (aka C++1y) are some new thread synchronization primitives: latches and barriers. The proposal is

It sounds like a good idea and the samples make it look very programmer-friendly. Unfortunately, I think the sample code invokes undefined behavior. The proposal says of latch::~latch():

Destroys the latch. If the latch is destroyed while other threads are in wait(), or are invoking count_down(), the behaviour is undefined.

Note that it says "in wait()" and not "blocked in wait()", as the description of count_down() uses.

Then the following sample is provided:

An example of the second use case is shown below. We need to load data and then process it using a number of threads. Loading the data is I/O bound, whereas starting threads and creating data structures is CPU bound. By running these in parallel, throughput can be increased.
void DoWork()
{
    latch start_latch(1);
    vector<thread*> workers;
    for (int i = 0; i < NTHREADS; ++i) {
      workers.push_back(new thread([&] {
        // Initialize data structures. This is CPU bound.
        ...
        start_latch.wait();
        // perform work
        ...
      }));
    }
    // Load input data. This is I/O bound.
    ...
    // Threads can now start processing
    start_latch.count_down();
}

Isn't there a race condition between the threads waking and returning from wait(), and destruction of the latch when it leaves scope? Besides that, all the thread objects are leaked. If the scheduler doesn't run all worker threads before count_down returns and the start_latch object leaves scope, then I think undefined behavior will result. Presumably the fix is to iterate the vector and join() and delete all the worker threads after count_down but before returning.

Is there a problem with the sample code?
Do you agree that a proposal should show a complete correct example, even if the task is extremely simple, in order for reviewers to see what the use experience will be like?

Note: It appears possible that one or more of the worker threads haven't yet begun to wait, and will therefore call wait() on a destroyed latch.

Update: There's now a new version of the proposal, but the representative example is unchanged.

Solution

Thanks for pointing this out. Yes, I think that the sample code (which, in its defense, was intended to be concise) is broken. It should probably wait for the threads to finish.

Any implementation that allows threads to be blocked in wait() is almost certainly going to involves some kind of condition variable, and destroying the latch while a thread has not yet exited wait() is potentially undefined.

I don't know if there's time to update the paper, but I can make sure that the next version is fixed.

Alasdair

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow