TPL Dataflow exception in transform block with bounded capacity

Question 1

This is expected behavior. If there's a fault "downstream", the error does not propagate "backwards" up the mesh. The mesh is expecting you to detect that fault (e.g., via process_block.Completion) and resolve it.

If you want to propagate errors backwards, you could have an await or continuation on process_block.Completion that faults the upstream block(s) if the downstream block(s) fault.

Note that this is not the only possible solution; you may want to rebuild that part of the mesh or link the sources to an alternative target. The source block(s) have not faulted, so they can just continue processing with a repaired mesh.

Question 2

The LinkTo method with the PropagateCompletion configuration, propagates the completion of the source block to the target block. So if the source block fails, the failure will be propagated to the target block, and so eventually both blocks will complete. The same is not true if the target block fails. In that case the source block will not get notified, and will continue accepting and processing messages. If we add the BoundedCapacity configuration in the mix, the internal output buffer of the source block will soon become full, preventing it from accepting more messages. And as you discovered, that can easily result in a deadlock.

To prevent a deadlock from happening, the simplest approach would be to ensure that an error in any block of the pipeline would cause the timely completion all its constituent blocks ASAP. Other approaches are also possible, as indicated by Stephen Cleary's answer, but in the majority of cases I expect the fail-fast approach to be the desirable behavior. Surprisingly this simple behavior is not so easy to achieve. No built-in mechanism is readily available for this purpose, and implementing it manually is tricky.

As of .NET 6, the only reliable way to complete forcefully a block that is part of a dataflow pipeline, is to Fault the block, and also discard its output buffer by linking it to a NullTarget. Faulting the block alone, or canceling it through the CancellationToken option, is not enough. There are scenarios where a faulted or canceled block will not complete. Here is a demonstration of the first case (faulted and not completed), and here is a demonstration of the second case (canceled and not completed). Both scenarios require that the block has been previously marked as completed, which can happen automatically and non-deterministically for all blocks participating in a dataflow pipeline, and are linked with the PropagateCompletion configuration. A GitHub issue reporting this problematic behavior exists: No way to cancel completing dataflow blocks. As of the time of this writing, no feedback has been provided by the devs.

Armed with this knowledge, we can implement a LinkTo-on-steroids method that can create fail-fast pipelines like this:

/// <summary>
/// Connects two blocks that belong in a simple, straightforward,
/// one-way dataflow pipeline.
/// Completion is propagated in both directions.
/// Failure of the target block causes purging of all buffered messages
/// in the source block, allowing the timely completion of both blocks.
/// </summary>
/// <remarks>
/// This method should be used only if the two blocks participate in an exclusive
/// producer-consumer relationship.
/// The source block should be the only producer for the target block, and
/// the target block should be the only consumer of the source block.
/// </remarks>
public static async void ConnectTo<TOutput>(this ISourceBlock<TOutput> source,
    ITargetBlock<TOutput> target)
{
    source.LinkTo(target, new DataflowLinkOptions { PropagateCompletion = true });
    try { await target.Completion.ConfigureAwait(false); } catch { }
    if (!target.Completion.IsFaulted) return;
    if (source.Completion.IsCompleted) return;
    source.Fault(new Exception("Pipeline error."));
    source.LinkTo(DataflowBlock.NullTarget<TOutput>()); // Discard all output
}

Usage example:

var data_buffer = new BufferBlock<int>(new() { BoundedCapacity = 1 });

var process_block = new ActionBlock<int>(
    x => throw new InvalidOperationException(),
    new() { BoundedCapacity = 2, MaxDegreeOfParallelism = 2 });

data_buffer.ConnectTo(process_block); // Instead of LinkTo

foreach (var k in Enumerable.Range(1, 5))
    if (!await data_buffer.SendAsync(k)) break;

data_buffer.Complete();
await process_block.Completion;

Optionally you could also consider awaiting all the constituent blocks of the pipeline, before awaiting the last one (or after in a finally region). This offers the advantage that in case of failure, you won't risk leaking fire-and-forget operations running in the background unobserved, before the next reincarnation of the pipeline:

try { await Task.WhenAll(data_buffer.Completion, process_block.Completion); } catch { }

You can ignore all the errors that might be thrown by the await Task.WhenAll operation, because awaiting the last block will convey most of the error-related information anyway. You may only miss additional errors that happened in blocks upstream after the failure of a block downstream. You can try to observe all errors if you want, but it will be tricky because of how the errors are propagated downstream: you may observe the same error multiple times. If you want to log diligently every single error, it is probably easier (and more accurate) to do the logging inside the lambdas of the processing blocks, instead of relying on their Completion property.

Shortcomings: The ConnectTo implementation above propagates the failure backwards one block at time. The propagation is not instantaneous, because a faulted block does not complete before the processing of any currently processed messages has finished. This can be an issue in case the pipeline is long (5-6 blocks or more), and the workload of each block is chunky. This additional latency is not only a waste of time, but also a waste of resources, for doing work that is going to be discarded anyway.

I've uploaded a more sophisticated version of the ConnectTo idea in this GitHub repository. It addresses the delayed-completion issue mentioned in the previous paragraph: a failure in any block is propagated instantaneously to all blocks. As a bonus it also propagates all the errors in the pipeline, as a flat AggregateException.

Note: This answer has been rewritten from scratch. The original answer (Revision 4) included some wrong ideas, and a flawed implementation of the ConnectTo method.