How can redundant systems decrease failure rate if they are all esentially the same?

https://stackoverflow.com/questions/12385393

01-07-2021
|

Question

In a systems analysis class the instructor talked about redundant systems. She told a story where 3 independent systems could land an aeroplane and on the test run all 3 failed simultaneously (somehow the pilots still had time to manually land). I don't understand why having redundant systems would help? If system A can't communicate with landing gears systems B and C couldn't either, right? Is the idea behind redundant systems "let's hope one of these doesn't have a bug"? If yes wouldn't it be too late when the bug is discovered (e.g. primary system failed so switching to secondary, oh wait the aeroplane blew up)?

It seems to me redundant systems are like saying "here's the same tool made a bunch of different ways but if you need a different tool your out of luck".

Solution

Identical systems help prevent a certain class of failure, namely an electronic or physical failure in a device, if that is a non-deterministic failure. In other words, if you have 3 hard drives in a RAID 5 arrangement, you are protected against one hard drive having the heads crash, but if two do, it's restore from backup time. Head crash of a hard drive, electronic failure, etc, are sorts of errors that this sort of thing protects from.

What it does not protect from are deterministic failures caused by software bugs in all three systems. Back to your RAID 5 array, if the hard drives are the same and there is a bug in the controller that causes the heads to write corrupted data on all three, the fact that you have three hard drives with corrupted data written to them is no real comfort.

So as a good, real world example here, a squadron of F22 fighters were flying from Hawaii to Japan when they crossed the international date line and experienced a bad avionics dump. Apparently from some sources they lost inertial reference, some air data, some communications, weapon systems everything. A software bug apparently didn't handle the dateline right and locked up all the redundant systems. The squadron had to return to base and land without instruments. Had the weather been bad, the computer crash would have turned into aircraft crashes though we hope the pilots would have been able to eject.

Additionally you have more complex failure cases with redundant systems and these are often harder for the individuals involved in maintaining safety to troubleshoot when things go wrong. For example how is the failure of the second system handled? This has caused terror and injury in at least one aircraft case. In that case, the failure of a second angle of attack unit (part of the Air Data/Inertial Reference Unit) caused the system to use inputs from the first failed sensor, which caused first an uncommanded climb and then an uncommanded plunge. The aircraft landed safely but this is a good reason to wear seatbelts when seated in an aircraft!

So as always there is a tradeoff here between robustness and being able to prove graceful handling of all possible states of failure. In general, in aviation this is seen to be a positive tradeoff, but it is not free.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow