Reliability and Redundancy

You are here:

Page Content

Redundancy

down

The most powerful tool for modeling redundancy is the reliability block diagram.
In this paragraph we use the 1-out-of-2 configuration in order to learn how MTBF can be calculated for redundant systems

Redundancy is a special form of failure tolerance. Redundancy is basically operating two (or more) same units in parallel, while at least only one (or only n-k) unit is needed for successful operation Redundancy is the most powerful, but also the most expensive means, when system failure rate and/or downtime needs to be as small as possible.
With redundancy you can improve every reliability metric by orders of magnitude. The rerason for this is that it is rather unlikely that both redundant units would fail at the same time. In particular, it is unlikely that the second unit would fail during the repair of the first unit, since repair times are usually very short compared to MTBF.

We assume the constant failure rate case, not because it is simple, but rather for the following reasons:

The constant failure rate case is actually a strong achievement every supplier and manufacturer should strive to
It turns out that despite constant failure rate of branches, the system failure rate will not be constant any more

Suppose a hypothetical system consisting only of two identical redundant branches. There shall be no further non-redundant elements, therefore this system is thoroughly redundant. Only one branch shall be needed in order to keep the system operational.. The failure rate of the branches shall be the same regardles of the state of the system, in particular, it shall make no difference for reliability and failure rate, whether only one or both branches are in operational state.
This hypothetical system appears to be a bit idealized, however, this does not affect the whole argument.

At first we want to keep it simpe. We want to know how long the system would be oparational without applying repair. This means we would operate the system and just wait until both branches would fail.
So we're interested in the failure rate vs. time for both branches. Remember from further above that reliability R(t) is the reliability function, or the probability of survival / success, therefore1-R(t) must be the probability of failure:

Also remember from above that the following applies for the constant failure rate case :

With

we obtain

And with the hazard rate

we finally obtain the failure rate for a redundant system with two identical branches:

Unfortunately this expression cannot be simplified.
Below is the failure rate vs. time. Note that the system failure rate is actually zero at t=0. Here is an easy approach to understand this: While a single branch can actually fail at t=0 (this is just the branch failure rate), it is extremely unlikely that both branches would fail exactly at the same time point. The mathematical probability for both failures happening exactly at the same time point is zero. If we want the system to fail at t=0, there is no other way than both branches failing at the same time point, at =0 to be exact. For t>0, the longer we wait, the more likely it becomes that we see both branches failing.
For t--> 00 the system failure rate asymptotically tends to the branch failure rate. Here is an easy approach to understand this is: The more time passes, the more likely it becomes that one branch would fail. After the first failure, with only one branch functioning, the system would have a constant failure rate, namely the branch failure rate.

Zero failure rate can also be seen on the reliability graph: R(t) is constant for small t:

Now, instead of waiting until the system fails, let's begin with repair as soon as one of the branches fails. Let's further assume that the system is designed in a manner that allows us to repair the faild branch, while the system continues to operate with the other branch.This would improve the system reliability dramatically, since system failures could then occur only during repair. The probability that the system fails during a repair is simply

(MTBF of two identical branches = 1/2 x MTBF of a single branch)
The mean time between system failure, MTBFSystem, is thus given by

This scenario is one of the rare cases where it's easier tu use MTBF for calculation instead of failure rate.
1/2 x MTBF_Branch is the MTBF for any of the two branches failing.
1/P1 is the multiple of (1/2 x MTBF_Branch ) it takes on average between two system failures.

To top

Next Topic