Guaranteed processing of data in WCF service

https://stackoverflow.com/questions/4411510

08-10-2019
|

Question

I have a WCF service that processes a feed of tens of thousands of records from SAP. The service call takes an XElement as its main parameter and processes the XML to update records in our database. The current intent is to have the WCF service be called asynchronously, and to have the service call send back to the caller the same document with statuses for each record processed.

I'm also looking into ways to multithread the processing of the data, though this may not end up buying me anything.

Because this could take a while, I'm concerned about what will happen if the WCF services dies, gets restarted, etc. I need to know which records I've processed, and which I haven't, and be able to complete processing on the remaining records.

The best I've been able to come up with is to update each node with a status (I have to do this, anyway, to send back to the caller), and save this file to the hard drive. But saving a file that large potentially 100,000 times doesn't really seem feasible.

What other strategies could I use to track these records as I process them?

TIA!
James

Solution

Maybe you could put the records (from your XML) in your database first, maybe in a special "records to be processed" table. Each row might also be tagged with some way to correlate them with a specific request. Process the rows from the database. As you process each one, update the status field (corresponding to the node status that you would have updated on the XmlElement). When you are finished, you could either go back and update the XML (if you haven't crashed in the meantime) or you could generate new XML (could be problematic if you can't round trip the conversion XML->database->XML.

If the service dies, it should be relatively simple to examine the database to find the records that have not been processed and finish processing them.

Alternatively, could write the XML file to disk once, keep a table in the database that holds ONLY the "status" field (and one or more keys to allow you to find the corresponding record in the XML file again), process the records, update the database "status" table as you go. When finished, update the status fields in the XML file in one fell swoop by reading the status from the "status" table.

Again, if the service dies, it should be simple enough to examine the "status" table to see which rows have been processed and which have not.

Good luck!

OTHER TIPS

I see using a MSMQ as being a great way to fulfill most of the needs you outlined about. If you broke the nodes into messages and entered them in a transactional queue.

Scaling the processing of the data would be easier through having more machines processing on the queue one you maxed out the capabilities of one.
If WCF "dies, gets restarted, etc" you don't lose anything.
The real problem you will have with this scenario getting the client to figure out where the service is at in the processing. The queue messages are one way only. You would probably need another service call that would evaluate the status of processing the queue.

Links to MSMQ WCF how-to's:

http://msdn.microsoft.com/en-us/library/ms789048.aspx

http://code.msdn.microsoft.com/msmqpluswcf

If your source and destination databases are SQL Server, then you should forget about middle-men and go straight to the built-in queuing support in the database: Service Broker. You get a number of advantages over MSMQ:

High Availability. Service Broker is built into the database, so the database high availability and disaster recoverability solution you already have implemented will automatically pick up your messaging solution too. Your cluster or database mirroring solution will work out-of-the-box and the messaging will fail-over transparently with the database failover.
recovery consistency. Having you messages and you data in the same recovery unit (the 'database') allows for simple backup-restore. With messages stored in MSMQ and data stored in database is not possible to have a consistent backup unless you freeze processing.
routing. SSB allows for queues to move to new physical locations w/o interrupting the message stream. See Service Broker Routing.
increased capacity. MSMQ have a very small size limit (4GB per queue) which can be quickly overrun in production, with disastrous results. SSB limit is 2GB per message and the queue size limits are the database size limits.
significantly higher throughput due local transactions instead of distributed transaction. With MSMQ you must enroll the database and the MSMQ into a distributed transaction, bot at the end where you enqueue and at the end where you dequeue. This dramatically reduces the throughput in MSMQ case.

There are other advantages too:

queue queriability: message queues can be queried with T-SQL SELECTs
one programming API: T-SQL. You do not need to learn MSMQ interface (and WCF channel over MSMQ is nothing but a shallow shim over the MSMQ API, is nothing like the other channels you may have used before). With SSB you program in T-SQL, using your existing T-SQL expertize.
Activation. When your message arrive, a stored procedure is launched into execution.
Session semantics: conversations
message correlation concurrency protection: Conversation Group Locks.
state management
Scalability: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data

The one thing you loose is the WCF service model programming. WCF makes it indeed extremely easy to write demo apps and you'll loose that.

Have you considered a messaging server, such as Microsoft Message Queuing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow