Design/Process in respect to mission-critical systems (Web Services)

https://softwareengineering.stackexchange.com/questions/369968

05-02-2021
|

Question

Overview

For company abc, teams are separated in terms of mission-critical applications and their extended applications. For example, customer xyz uses mission-critical data, but there are business constraints/flows that cannot be achieved with the mission-critical system alone. To maintain data integrity (mission-critical system is the system of record), web service(s) have been made to interact with the mission-critical data. These web services send one entity (business object) in respect to the mission-critical system.

From the mission-critical side, there is a dependency on some third party tools. One of these tools queues the web service calls into threads, and it has the ability to reject currently queued or new web service calls. Rejection meaning that the data did not make it into the destination system. Although I did not get clarification on all criteria, the typical concerns are used, such as large data set(s), network traffic, or database locks.

Customer Request

As stated above, the extended application is supposed to provide functionality that is more in line with their expected work flow(s). A request has been made to modify extended application data in bulk, such as vendor x was bought out by vendor w so that all of the current occurring transactions must be mapped/set to the new vendor. In this case, there can be 1 - n records that need to be changed, and each record may not exist together as one entity in the mission-critical system.

The last step to this process would be to upload the changed data to the mission-critical system, because external systems (accounting, vendors, etc.) have a requirement to only use data from that system.

Notes

There is the possibility of getting a custom web service, but there would still be an issue if the third party tool decides to reject the call after a successful web service call was determined.

Processes do exist where bulk data is exported to files (CSV or spreadsheet), but the use has been focused on data going outside of the company. If a web service could accept a file, then there is sill the rejection issue. If some batch/ETL could be used, then there is a risk with data integrity, since that would create a different branch to manipulate data that is not within the context of the mission-critical system.

An idea had been tossed around to queue the data entries in our own staging/transaction area, so that we could still upload through the web service and keep record, if it was not fully processed. The concern with this idea is the network traffic. (EX: 1000 entries were updated and this data spans 50 entities. The web service would have 50 calls to make with its current format.)

Questions

For the design:

Where does the bulk data manipulation fit in respect to the customer request?

Should the mission-critical system define a way to accept bulk data?

Should this only be the responsibility of the team associated with the extended application? (EX: Define a "reasonable" process to the customer in respect to how data integrity is maintained; Or defining staging/transaction areas with some kind of job that only passes the data via web service calls)

With the current company structure (many teams to handle the subsets), what could be added to an individual team's process to help with future functionality requests at a similar level as above? (requests that cross into mission-critical data integrity)

Solution

You should use a third-party library or build your own system which immediately writes to a database prior to attempting anything further. Should something go wrong, you still have that information saved somewhere for recovery. Ideally it would retry to send should it fail, again, keeping track of status on the database before performing any action.

If you need do, you can link messages with a common identifier such that should any single message fail, you can reverse or at least attempt to reverse all messages with the same common identifier in the opposite order in which it would be executed, starting with the point of error and working backwards. Just remember to always keep track of state before attempting any rollbacks!

Of course then you'd have the issue where you cannot write to the database, but the database should be local to the server and therefore, unlikely to suffer problems of networking. There is the chance it fails just the same, and if it continues to fail after several retries, there should be at least a last ditch effort to notify a systems administrator through e-mail that something is seriously wrong. This would be a rare situation indeed, as there would have to be problems both with sending and with writing to the database before this situation occurs.

Where does the bulk data manipulation fit in respect to the customer request? Should the mission-critical system define a way to accept bulk data?

Don't make every update a single message! Remember that you need only save enough information necessary to recreate the request. Save the csv/spreadsheet file on the server with a unique name and in the message, indicate the path on the server to get to the file. You're both reducing the clutter in your message queue and removing the complication that you need to house all those columns and their values in the database. The part of your program responsible for retrying messages should also be able to process csv/spreadsheet in such instances and parse/send them as would have been done should the request gone well in the first place.

Should this only be the responsibility of the team associated with the extended application? (EX: Define a "reasonable" process to the customer in respect to how data integrity is maintained; Or defining staging/transaction areas with some kind of job that only passes the data via web service calls)

The team responsible for the web service receiving the data should have every right to refuse it should the data not be valid. However, I would strongly advise you to come up with a standard error system defining common errors which may occur (and potentially the possibility of errors which are unique to that web service).

The system I defined at my workplace was code 0 for "everything went well", codes 100-199 for warnings, 200-299 for common errors, and anything beyond 1000 for custom error codes. So you know for instance, to retry if there was an internal error on their side, but not to retry if the data is considered invalid.

I would also avoid giving guarantees to the customer. Instead tell them the policy you use in case of error, including retries and when attempts are made, and finally any notification should something not go as expected.

The bottom line is that so long as you write to the database prior to any operation, you can retrace the exact point of failure for mission-critical systems, and though you can't guarantee success, with this, at least you can take actions should it fail.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange