How to treat unhandled exceptions? (Terminate the application vs. Keep it alive)

https://softwareengineering.stackexchange.com/questions/399424

03-03-2021
|

Question

What is best practice when a unhandled exceptions occurs in a desktop application?

I was thinking about to show a message to the user, so that he can contact support. I would recommend to the user to restart the application, but not force it. Similar to what is discussed here: ux.stackexchange.com - What's the best way to handle unexpected application errors?

The project is a .NET WPF application, so the described proposal could look like this (Note that this is a simplified example. Probably it would make sense to hide the exception details until the user click on "Show Details" and provide some functionality to easily report the error):

public partial class App : Application
{
    public App()
    {
        DispatcherUnhandledException += OnDispatcherUnhandledException;
    }

    private void OnDispatcherUnhandledException(object sender, DispatcherUnhandledExceptionEventArgs e)
    {
        LogError(e.Exception);
        MessageBoxResult result = MessageBox.Show(
             $"Please help us fix it and contact support@example.com. Exception details: {e.Exception}" +
                        "We recommend to restart the application. " +
                        "Do you want to stop the application now? (Warning: Unsaved data gets lost).", 
            "Unexpected error occured.", MessageBoxButton.YesNo);

        // Setting 'Handled' to 'true' will prevent the application from terminating.
        e.Handled = result == MessageBoxResult.No;
    }

    private void LogError(Exception ex)
    {
        // Log to a log file...
    }
}

In the implementation (Commands of ViewModels or event handler of external events) I would then only catch the specific exogenous exception and let all other exceptions (boneheaded and unknown exceptions) bubble up to the "Last resort handler" described above. For a definition of boneheaded and exogenous exceptions have a look at: Eric Lippert - Vexing exceptions

Does it make sense to let the user decide if the application should be terminated? When the application is terminated, then you for sure have no inconsistent state... On the other hand the user may loose unsaved data or is not able to stop any started external process anymore until the application is restarted.

Or is the decision if you should terminate the application on unhandled exceptions depending of the type of application you are writting? Is it just a trade off between "robustness" vs. "correctness" like described in Code Complete, Second Edition

To give you some context what kind of application we are talking about: The application is mainly used to control chemical lab instruments and show the measured results to the user. To do so the WPF applications communicates with some services (local and remote services). The WPF application does not communicate directly with the instruments.

Solution

You have to expect your program to terminate for more reasons than just an unhandled exception anyway, like a power failure, or a different background process which crashes the whole system. Therefore I would recommend to terminate and restart the application, but with some measures to mitigate the consequences of such a restart and minimize the possible data loss.

Start with analysing the following points:

How much data can actually get lost in case of a program termination?
How severe is such a loss really for the user? Can the lost data reconstructed in less than 5 minutes, or are we talking about losing a days work?
How much effort is it to implement some "intermediate backup" strategy? Don't rule this out because "the user would have to enter a change reason" on a regular save operation, as you wrote in a comment. Better think of something like a temporary file or state, which may be reloaded after a program crash automatically. Many types of productivity software does this (for example MS Office and LibreOffice both have an "autosave" feature and crash recovery).
In case data was wrong or corrupted, can the user see this easily (maybe after a restart of the program)? If yes, you may offer an option to let the user save the data (with some small chance it is corrupted), then force a restart, reload it and let the user check if the data looks fine. Make sure not to overwrite the last version that was saved regularly (instead write to a temporary location/file) to avoid corrupting the old version.

If such an "intermediate backup" strategy is a sensible option depends ultimately on the application and its architecture, and on the nature and structure of the data involved. But if the user will loose less than 10 minutes of work, and such a crash happens once a week or even more seldom, I would probably not invest too much thought into this.

OTHER TIPS

It depends to some extent on the application you're developing but in general, I'd say that if your application encounters an unhandled exception, you need to terminate it.

Why?

Because you can no longer have any confidence in the state of the application.

Definitely, provide a helpful message to the user, but you should ultimately terminate the application.

Given your context, I would definitely want the application to terminate. You do not want software running in a lab to produce corrupt output and since you didn't think to handle the exception, you have no idea why it was thrown and what is happening.

Considering that this is meant for a chemical lab and that your application does not control the instruments directly but rather through other services:

Force termination after showing the message. After an unhandled exception your application is in an unknown state. It could send erroneous commands. It can even invoke nasal demons. An erroneous command could potentially waste expensive reagents or bring danger to equipment or people.

But you can do something else: gracefully recover after restarting. I assume that your application doesn't bring down those background services with itself when it crashes. In that case you can easily recover the state from them. Or, if you have more state, consider saving it. In a storage which has provisions for data atomicity and integrity (SQLite maybe?).

Edit:

As stated in the comments, the process you control may require changes fast enough that the user won't have time to react. In that case you should consider silently restarting the app in addition to graceful state recovery.

Trying to generally answer this question at the top level of the program is not a smart play.

If something has bubbled up all the way, and at no point in the architecture of the application did anyone consider this case, you have no generalizations you can make about what actions are, or are not, safe to take.

So, no, it is definitely not a generally acceptable design to allow the user to choose whether or not the application attempts to recover, because the application and the developers demonstratively have not done the due diligence necessary to find out if that's possible or even wise.

However, if the application has high-value portions of it's logic or behavior that have been engineered with this sort of failure recovery in mind, and it is possible to leverage them in this case, then by all means, do so - In that case, it may be acceptable to prompt the user to see if they want to attempt recovery, or if they would like to just call it quits and start over.

This sort of recovery is not generally necessary or advisable for all (or even most) programs, but, if you are working on a program for which this degree of operational integrity is required, that might be a circumstance in which presenting this sort of a prompt to a user would be a sane thing to do.

In leiu of any special failure recovery logic - No, don't do this. You literally have no idea what will happen, if you did, you'd have caught the exception further down and handled it.

The problem with "exceptional exceptions", i.e. exceptions that you haven't foreseen, is that you don't know which state the program is in. For example, trying to save the user's data could actually destroy even more data.

For that reason, you should terminate the application.

There is a very interesting idea called Crash-only Software by George Candea and Armando Fox. The idea is that if you design your software in such a way that the only way to close it is to crash it and the only way to start it is to recover from a crash, then your software will be more resilient, and the error recovery code paths will be much more thoroughly tested and exercised.

They came up with this idea after noticing that some systems started faster after a crash than after an orderly shutdown.

A good, although no longer relevant example, are some older versions of Firefox that not only start faster when recovering from a crash, but also have a better startup experience that way! In those versions, if you shut down Firefox normally, it would close all open tabs and start up with a single empty tab. Whereas when recovering from a crash, it would restore the open tabs at the time of the crash. (And that was the only way to close Firefox without losing your current browsing context.) So, what did people do? They simply never closed Firefox and instead always pkill -KILL firefoxed it.

There is a nice writeup about crash-only software by Valerie Aurora on Linux Weekly News. The comments are also worth a read. For example, someone in the comments rightfully points out that those ideas are not new, and are in fact more or less equivalent to the design principles of Erlang/OTP based applications. And, of course, looking at this today, another 10 years after Valerie's and 15 years after the original article, we might notice that the current micro service hype is re-inventing those same ideas yet again. Modern Cloud-scale data center design is also an example on a coarser granularity. (Any computer can crash at any time without affecting the system.)

It is, however, not enough to just let your software crash. It has to be designed for it. Ideally, your software would be broken up into small, independent components that each can crash independently. Also, the "crash mechanism" should be outside of the component that is being crashed.

The proper way to handle most exceptions should be to invalidate any object that might be in a corrupt state as a consequence, and continue execution if invalidated objects don't prevent that. For example, the safe paradigm for updating a resource would be:

acquire lock
try
  update guarded resource
if exception
  invalidate lock
else
  release lock
end try

If an unexpected exception occurs while updating the guarded resource, the resource should be presumed in a corrupt state, and the lock invalidated, regardless of whether the exception is of a type that would otherwise be benign.

Unfortunately, resource guards implemented via IDisposable/using will get released whenever the guarded block exits, without any way of knowing whether the block exited normally or abnormally. Thus, even though there should be well-defined criteria for when to continue after an exception, there's no way of telling when they apply.

You might use the approach that every single iOS and MacOS app follows: An uncaught exception takes down the application immediately. Plus many errors, like array out of bounds or just arithmetic overflow in newer applications do the same. No warning.

In my experience many users don’t take any notice but just tap in the app icon again.

Obviously you need to make sure that such a crash doesn’t lead to significant data loss and definitely doesn’t lead to costly mistakes. But an alert “Your app will crash now. Call support if it bothers you” isn’t helping anyone.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange