Threading model for data processing in a directed graph

https://stackoverflow.com/questions/9552146

05-12-2019
|

Question

I'm going to be designing a simple data analysis tool which processes different kinds of data through a directed graph. The directed graph is somewhat customizable by the user. Each node will consist of logging, analysis, and mathematical operations on the data passing through. The graph is similar in many ways to a neural network except with additional processing at each node. Some nodes do simple operations to the data elements passing through while other nodes have complex algorithms.

How do I multithread the processing in this directed graph such that I can get the result out of the graph in the fastest and most efficient way? Memory is not an issue here, and neither is the time it takes to initialize this task.

I've thought of a couple different methods to multithread the work:

Each thread instance 'follows' each data element entering the start node in this graph. The thread will stay with this data element as it passes through each node, calling the processing method on each node all the way down the tree. This will essentially require a single thread per data element entering the system. Of course, once the data element has been carried through the entire system, the thread will be recycled. The problem here is when two outgoing edges on a node exist--the thread would need to follow both (does this mean pull a new thread from a thread pool?).
Create a thread per node and create a data buffer on each graph edge. The worker thread on the node will continually check to hold data in the instance that one thread takes longer with the data. The problem with this approach is the inherent 'polling' of the buffer for having enough data to start processing it--perhaps a small price to pay for simplifying the data flow for any graph configuration.

Can anyone think of a better way, or which one do you recommend? I'm looking for the least latency through the system and the ability to constantly process a stream of incoming data.

Thanks! Brett

Solution

First of all, it is not a good idea to spawn unlimited amount of threads (e.g. thread per node). Usually you want to have at most 1.5-3 times more threads than your CPU cores (e.g. 6-12 threads for quad-core).

I would recommend to use thread-pool and tasks. In such case your problem can be rephrased as what size your tasks should have.

Both of the methods you mentioned are valid and each has its own pros and cons.

One task per data input should be easy to implement, as the algorithm for graph processing will stay single-threaded. The overhead of context-switching, synchronization and data-passing between threads is almost none.

When there are two outgoing edges on a node, then this single task has to follow both of them. This is a standard part of all algorithms for graph traversal, e.g. depth-first search or breadth-first search.

One task per graph node can improve the latency in case that your graphs have many "branches" that can be processed in parallel. However this approach requires more complex design of graph processing and there will be higher overhead of thread synchronization. Actually the cost of multi-threading might be higher than the benefits gained by parallel processing of the graph.

When there are two outgoing edges on a node, you can create two new task and queue then on the thread-pool. (Or queue one task and continue with processing the other one.)

The more difficult problem is when there are two incoming edges on a node. The task processing the node will have to wait until data for both edges are available.

Conclusion: I would personally start with the first option (one task per data input) and see, how far you can get with it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow