Question

Erlang's (or Joe Armstrong's?) advice NOT to use defensive programming and to let processes crash (rather than pollute your code with needless guards trying to keep track of the wreckage) makes so much sense to me now that I wonder why I wasted so much effort on error handling over the years!

What I wonder is - is this approach only applicable to platforms like Erlang? Erlang has a VM with simple native support for process supervision trees and restarting processes is really fast. Should I spend my development efforts (when not in the Erlang world) on recreating supervision trees rather than bogging myself down with top-level exception handlers, error codes, null results etc etc etc.

Do you think this change of approach would work well in (say) the .NET or Java space?

Was it helpful?

Solution

It's applicable everywhere. Whether or not you write your software in a "let it crash" pattern, it will crash anyway, e.g., when hardware fails. "Let it crash" applies anywhere where you need to withstand reality. Quoth James Hamilton:

If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed.

This doesn't precisely mean "never use guards," though. But don't be afraid to crash!

OTHER TIPS

Yes, it is applicable everywhere, but it is important to note in which context it is meant to be used. It does not mean that the application as a whole crashes which, as @PeterM pointed out, can be catastrophic in many cases. The goal is to build a system which as a whole never crashes but can handle errors internally. In our case it was telecomms systems which are expected to have downtimes in the order of minutes per year.

The basic design is to layer the system and isolate central parts of the system to monitor and control the other parts which do the work. In OTP terminology we have supervisor and worker processes. Supervisors have the job of monitoring the workers, and other supervisors, with the goal of restarting them in the correct way when they crash while the workers do all the actual work. Structuring the system properly in layers using this principle of strictly separating the functionality allows you to isolate most of the error handling out of the workers into the supervisors. You try to end up with a small fail-safe error kernel, which if correct can handle errors anywhere in the rest of the system. It is in this context where the "let-it-crash" philosophy is meant to be used.

You get the paradox of where you are thinking about errors and failures everywhere with the goal of actually handling them in as few places as possible.

The best approach to handle an error depends of course on the error and the system. Sometimes it is best to try and catch errors locally within a process and trying to handle them there, with the option of failing again if that doesn't work. If you have a number of worker processes cooperating then it is often best to crash them all and restart them again. It is a supervisor which does this.

You do need a language which generates errors/exceptions when something goes wrong so you can trap them or have them crash the process. Just ignoring error return values is not the same thing.

It is called fail-fast. It's a good paradigm provided you have a team of people who can respond to the failure (and do so quickly).

In the NAVY all pipes and electrical is mounted on the exterior of a wall (preferably on the more public side of a wall). That way, if there is a leak or issue, it is more likely to be detected quickly. In the NAVY, people are punished for not responding to a failure, so it works very well: failures are detected quickly and acted upon quickly.

In a scenario where someone cannot act on a failure quickly, it becomes a matter of opinion whether it is more beneficial to allow the failure to stop the system or to swallow the failure and attempt to continue onward.

I write programs that rely on data from real world situations and if they crash they can cause big $$ in physical damage (not to mention big $$ in lost revenue). I would be out of a job in a flash if I did not program defensively.

With that said I think that Erlang must be a special case that not only can you restart things instantly, that a restarted program can pop up, look around and say "ahhh .. that was what I was doing!"

My colleagues and myself thought about the topic not especially technology wise but more from a domain perspective and with a safety focus.

The question is "Is it safe to let it crash?" or better "Is it even possible to apply a robustness paradigm like Erlang’s “let it crash” to safety-related software projects?".

In order to find an answer we did a small research project using a close-to-reality scenario with industrial and especially medical background. Take a look here (http://bit.ly/Z-Blog_let-it-crash). There is even a paper for download. Tell me what you think!

Personally I think it is applicable in many cases and even desirable, especially when there is a lot of error handling to do (safety-related systems). You cannot always use Erlang (missing real time features, no real embedded support, costumer whishes ...), but I'm pretty sure you can implement it otherwise (e.g. using threads, exceptions, message passing). I haven't tried it yet though, but I'd like to.

IMHO Some developers handle/wrap checked exceptions with code which add little value. It is often simpler to allow a method to throw the original exception unless you are going to handle it and add some value.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top