Which errors to handle in clojure REST+disk-I/O app?

https://stackoverflow.com/questions/14070255

12-12-2021
|

Question

I have a server application which, somewhat simplified, periodically takes measurements via a rest-api from a not beefy-enough-server. The values should be cached locally (and are timestamped/immutable), maybe stored as a FloatBuffer where every position corresponds to a measurement sample. There's a webbrowser application which periodically makes ajax requests to update some neat statistics on the webpage, like this picture:

Picture describing system architechture for REST measurement service and presentation

Assuming that the server is up and running, there are still many places where errors could occur

The REST measurement server could be unreachable (where the server just keeps storing measurements locally)
The network connection to the measurement server could be down
The storage could be full or somehow corrupt
The Browser could lose contact with the server and try to take it up again

My strategy for coping with errors in general should be the following:

If there are problems getting values from the measurement service via REST, there should be retries every minute. If the error persists for more than 30 minutes consequtively the administrator should be notified. In case of disk problems the administratior should be notified at once, or preferably even before the disk goes full.

The end user experience should be as transparent to the errors as possible, but the application should still function as sanely as possible, by notifying the user an error have occured but also show the latest data availiable.

How do I find which errors to cope with regarding network problems (using clj-http via an agent triggered by a ScheduledThreadPoolExecutor job to make REST request) and regarding problems with disk when trying to flush the FloatBuffer?

What is a sane way to implement the quite stateful yet algorithmic strategy mentioned above? Should I simply handle the error when the agent reports it and switch to some kind of a recovery-mode job?

Solution

In an interaction like this involving several components over different systems, the end user should be avoided to do many synchronous operations. Its only the sync operations that are time bound and require error reporting immediately.

Once the interaction of the end user will the system is async, you have a lot of choice on the error handling mechanism too... At the point where the end user interacts with the system you can have a error mapper that translates all the errors that come from the various components to the user understandable messages.

The user should be given an API to query the status of the request he submitted. That should be able to tell if the request is complete or if there is an error. If the network connections are going to take more time the status message can inform the user about that.

Every component will report error at some point in any distributed system. There are error listener interfaces provided by some APIs for this. This will asynchronously report errors to the user. Have a look at APIs like JMS (http://docs.oracle.com/javaee/5/tutorial/doc/bnceh.html). They are proven to be used in complex systems and have good error handling mechanisms.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow