Question

I'm looking for some good articles on fault tolerant software architectures. Could I please have some recommendations.

Was it helpful?

Solution

Handbook of Software Reliability Engineering you can read it in pdf. One of the main principles of software reliability is fault tolerance.

Take a look at chapter 14 Fault-Tolerant software.

OTHER TIPS

I found 'Release It!' to be an excellent read.

In Release It!, Michael T. Nygard shows you how to design and architect your application for the harsh realities it will face. You’ll learn how to design your application for maximum uptime, performance, and return on investment.

Link dump! :)

These are some of the on-line things I got some ideas (or just for terminology checkup) from when researching a certain aspect of redundancy.

ACM requires membership.

It would be very difficult to sum it up in one article since there are multiple ways to achieve fault tolerance in software. These principles deal with Desktop, Server applications and/or SOA. Also there are multiple methodologies, few of which we already follow without knowing; Exception handling for example. It would be a herculean feat to try to drill down all the concepts in one article. You can find a lot of articles with a simple search on google.

For my FYP, I researched on OS wide Self Healing systems. I followed the Sun Solaris 10 architecture and IBM's Autonomous Computing research (http://www.research.ibm.com/autonomic/).

This article about Software Fault Handling techniques covers the following topics:

  • Timeouts
  • Audits
  • Exception Handling
  • Task Rollback
  • Incremental Reboot
  • Voting
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top