Previous Topic

Next Topic

Book Contents

Book Index

Home Page

Formulating Reliability Targets

The Chain Is Only as Strong as its Weakest Link

A telephony system is a complex collection of components. Each of these components has a failure rate associated with it, and as a result, the reliability of the whole system can never be higher than the lowest reliability of any of the individual components. Furthermore, we can use the product of the reliability metrics for each component to give us a rough estimate for the reliability of the whole system.

For example, consider the hypothetical system at ACME Anvil Company shown in the table below with associated reliability metrics:

Component

Estimated Reliability

Phone Lines from Phone Company

99.97%

Office PBX System

99.98%

Power Supply to Office and Phone System

99.6%

Voice Mail Server with Uninterruptible Power Supply (UPS)

99.3%

If any one component fails, the whole system becomes non-functional. Therefore, ACME Anvil's system cannot achieve anything higher than 99.3% reliability because the voice mail server only works 99.3% of the time. For the system as a whole, the estimated reliablity is the product of the reliability of the four components: 99.97% x 99.98% x 99.6% x 99.3% = 98.8%! Surprised? This multiplicative drop in reliability is the result of multiple components that each has its own failure rate.

Developing a Realistic Targets for the Whole System

Some applications are such that Four or Five Nines reliablity are part of the application requirements. For example, it may be safe to assume that a nuclear reactor requires 99.99% or better reliability, regardless of cost. Since Active Call Center is not licensed for control of nuclear reactors and similar applications, these applications are not discussed further here.

Most applications have Two to Three Nines reliability requirements. The usual objective in these cases is to achieve the reliability target by maximizing reliability within the available spending budget.

In formulating reliability targets, it helps to compile a table of each of the system's components and it's associated estimated reliability metric. This information can help to quickly analyze reliability for the whole system as demonstrated below.

A table of reliability rates for each component quickly reveals reliability levels above which it will be prohibitively expensive to implement a system. Recall that in the earlier ACME Anvil example, the phone lines from the phone company were only 99.97% reliable. This means that to get anything better than 99.97% reliability, special service will be required from the phone company. Presumably these services would be very expensive, and so we might inform ACME Anvil's management that anything more than 99.97% is out of their budget.

The table of reliability metrics for ACME Anvil's system also helps identify where budget dollars can be best spent to gain maximum increases in overall system reliability. Remember that the two weakest components of ACME's system are the power supply to the phone system and the voice mail server. It's very easy to boost the reliability of the power supply by adding a UPS battery backup with several hours of battery capacity. Let's assume we could get such a battery unit for $500 and that it would boost the power supply's reliability to 99.98%. The new reliability metric for the whole system would then be 99.97% x 99.98% x 99.98% x 99.3% = 99.2%. That's a significant increase in reliability for only spending $500 (from the previous 98.8% level)!

Let's suppose that ACME Anvil's CEO has informed us that they would like to have 99.5% reliability. The simple analysis performed to this point has revealed several important findings: (a) The system can easily be brought up to 99.2% reliability by adding a battery backup UPS, (b) With three of the four components after the upgrade at reliability rates of 99.97% or higher, the only other place to improve reliability for the system is at the voice mail server. Hmmm... sounds like a job for Active Call Center!

Apply the Analysis to an Appropriate Level of Detail

The basic strategy presented above is as follows:

  1. Identify all system components.
  2. Compile a table with estimated reliability metrics for each component.
  3. Estimate reliability metrics for the system as a whole by taking a product of the individual components.
  4. Establish achievable lower and upper bounds for reliability based on cost factors and reliability of existing components.
  5. Evaluate estimated system reliability against the required target.
  6. Review the items in the table of components to decide where best to apply budget dollars for maximum reliability increases.

This same general analysis can be applied to each component at very fine levels of detail. The ACME Anvil example above did not include much detail, but a more detailed analysis could easily be done. For example, we might analyze reliability of the voice mail server and find:

Server Component

Estimated Reliability

Processor

99.999%

RAM

99.999%

Hard Drive

99.8%

Software

99.5%

Overall (product of component metrics)

99.3%

Based on this analysis, management might decide to install a RAID hard drive array to increase the hard drive reliability up to 99.99% and thus increase overall server reliability to 99.4%. It's also apparent that the software's reliability is going to have to improve to reach management's 99.5% target.

In this example, the extra detail has helped narrow down the focus of the reliability improvement task to the voice mail server's software.

See Also

Enterprise Class Telephony

Riding the Nines

Meeting Reliability Targets with Active Call Center

Scaling Active Call Center Deployments