Question

This is my first post here, so apologies if this isn't structured well.

We have been tasked to design a tool that will:

  • Read a file (of account IDs), CSV format
  • Download the account data file from the web for each account (by Id) (REST API)
  • Pass the file to a converter that will produce a report (financial predictions etc) [~20ms]
  • If the prediction threshold is within limits, run a parser to analyse the data [400ms]
  • Generate a report for the analysis above [80ms]
  • Upload all files generated to the web (REST API)

Now all those individual points are relatively easy to do. I'm interested in finding out how best to architect something to handle this and to do it fast & efficiently on our hardware.

We have to process roughly around 2 Million accounts. The square brackets gives an idea of how long each process takes on average. I'd like to use the maximum resources available on the machine - 24 core Xeon processors. It's not a memory intensive process.

Would using TPL and creating each of these as a task be a good idea? Each has to happen sequentially but many can be done at once. Unfortunately the parsers are not multi-threading aware and we don't have the source (it's essentially a black box for us).

My thoughts were something like this - assumes we're using TPL:

  • Load account data (essentially a CSV import or SQL SELECT)
  • For each Account (Id):
    • Download the data file for each account
    • ContinueWith using the data file, send to the converter
    • ContinueWith check threshold, send to parser
    • ContinueWith Generate Report
    • ContinueWith Upload outputs

Does that sound feasible or am I not understanding it correctly? Would it be better to break down the steps a different way?

I'm a bit unsure on how to handle issues with the parser throwing exceptions (it's very picky) or when we get failures uploading.

All this is going to be in a scheduled job that will run after-hours as a console application.

Was it helpful?

Solution

I would think about using some kind of messagebus. So you can seperate the steps and if one wouldn't work (for example because the REST Service isn't accessible for some time) you can store the message for processing them later on.

Depending on what you use as a messagebus you can introduce threads with it.

In my opinion you could better design workflows, handle exceptional states and so on, if you have a more high level abstraction like a service bus.

Also beaucase the parts could run indepdently they don't block each other.

One easy way could be to use servicestack messaging with Redis ServiceBus.

Some advantages quoted from there:

  • Message-based design allows for easier parallelization and introspection of computations

  • DLQ messages can be introspected, fixed and later replayed after server updates and rejoin normal message workflow

OTHER TIPS

I think the easy way to start with multiple thread in your case, will be putting the entire operation for each account id in a thread (or better, in a ThreadPool). In the proposed way below, I think you will not need to control inter-thread operations.

Something like this to put the data on the thread pool queue:

var accountIds = new List<int>();
foreach (var accountId in accountIds)
{
    ThreadPool.QueueUserWorkItem(ProcessAccount, accountId);
}

And this is the function you will process each account:

public static void ProcessAccount(object accountId)
{
    // Download the data file for this account
    // ContinueWith using the data file, send to the converter
    // ContinueWith check threshold, send to parser
    // ContinueWith Generate Report
    // ContinueWith Upload outputs
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top