Parallelization Considerations

https://stackoverflow.com/questions/4227620

26-09-2019
|

Question

I want to get the community's perspective on this. If I have a process which is heavily DB/IO bound, how smart would it be to parallelize individual process paths using the Task Parallel library?

I'll use an example ... if I have a bunch of items, and I need to do the following operations

Query a DB for a list of items
Do some aggregation operations to group certain items based on a dynamic list of parameters.
For each grouped result, Query the database for something based on the aggregated result.
For each grouped result, Do some numeric calculations (3 and 4 would happen sequentially).
Do some inserts and updates for the result calculated in #3
Do some inserts and updates for each item returned in #1

Logically speaking, I can parallelize into a graph of tasks at steps #3, #5, #6 as one item has no bearing on the result of the previous. However, each of these will be waiting on the database (sql server) which is fine and I understand that we can only process as far as the SQL server will let us.

But I want to logically distribute the task on the local machine so that it processes as fast as the Database lets us without having to wait for anything on our end. I've done some mock prototype where I substitute the db calls with Thread.Sleeps (I also tried some variations with .SpinWait, which was a million times faster), and the parallel version is waaaaay faster than the current implementation which is completely serial and not parallel at all.

What I'm afraid of is putting too much strain on the SQL server ... are there any considerations I should consider before I go too far down this path?

Solution

Another option would be to create a pipeline so that step 3 for the second group happening at the same time as step 4 for the first group. And if you can overlap the updates at step 5, do that too. That way you're doing concurrent SQL accesses and processing, but not over-taxing the database because you only have two concurrent operations going on at once.

So you do steps 1 and 2 sequentially (I presume) to get a collection of groups that require further processing. Then. your main thread starts:

for each group
  query the database
  place the results of the query into the calc queue

A second thread services the results queue:

while not end of data
  Dequeue result from calc queue
  Do numeric calculations
  place the results of the query into the update queue

A third thread services the update queue:

while not end of data
  Dequeue result from update queue
  Update database

The System.Collections.Concurrent.BlockingCollection<T> is a very effective queue for this kind of thing.

The nice thing here is that if you can scale it if you want by adding multiple calculation threads or query/update threads if the SQL Server can handle more concurrent transactions.

I use something very similar to this in a daily merge/update program, with very good results. That particular process doesn't use SQL server, but rather standard file I/O, but the concepts translate very well.

OTHER TIPS

If the parallel version is much faster than the serial version, I would not worry about the strain on your SQL server...unless of course the tasks you are performing are low priority compared to some other significant or time-critical operations that are also performed on the DB server.

Your description of tasks is not well understood by me, but it almost sounds like more of those tasks should have been performed directly in the database (I presume there are details that make that not possible?)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow