Reliability-Centered Maintenance (Part 1)

Jan 1, 2010

Resources > Magazine Articles > Reliability-Centered Maintenance (Part 1)

A strategy known as “Reliability-Centered Maintenance” has drastically reduced the cost of maintaining transport and military aircraft, while simultaneously improving dispatch reliability. Isn’t it time we applied this approach to piston GA?

More than 30 years ago, in 1974, the U.S. Department of Defense commissioned United Airlines to prepare a report on the techniques used by the airline industry to develop cost-efficient maintenance programs for civil airliners. The resulting report, titled Reliability-Centered Maintenance [F. S. Nowlan & H. Heap, National Technical Information Service, 1978] described a radically different approach to aircraft maintenance, based on rigorous analysis of traditional maintenance practices and evaluation of their shortcomings.

Traditionally, a major emphasis of aircraft maintenance programs had been defining specific overhaul and retirement intervals (TBOs) in order to achieve a satisfactory level of reliability. However, engineering analysis of reams of operational data from a number of major air carriers produced fascinating insights into the conditions that must exist for scheduled maintenance to be effective. Two discoveries were especially surprising:

  • For a complex item (like an engine), scheduled overhaul has little effect on the overall reliability unless the item has a single dominant failure mode.
  • There are many items for which there is simply no form of scheduled maintenance that is technically and economically feasible.

For example, RCM researchers determined back in the 1970s that scheduled overhauls on turbine engines do not produce any reliability or economic benefit, and that maintaining such powerplants strictly on-condition provides longer life, reduced maintenance costs, and improved reliability. (Next month, we’ll look at some compelling data that suggests the same applies to piston aircraft engines.)

Reliability-Centered Maintenance (RCM) has resulted in huge cost savings for the airlines. For example, the initial maintenance program for the Douglas DC-8 (developed before RCM) required scheduled overhaul of 339 items, compared to just 7 items for the larger and more complex DC-10 (whose maintenance program was developed using RCM). By the way, none of those 7 items are engines.

As another example, the pre-RCM DC-8 required 4,000,000 man-hours of structural inspections during its initial 20,000 hours of operation, while the post-RCM Boeing 747 required only 66,000 man-hours over the same interval. That’s a reduction of nearly two orders of magnitude.

Not only are these cost savings immense, but they were achieved with no decrease in safety or dispatch reliability. To the contrary, safety and reliability actually improved in almost every instance when emphasis shifted from scheduled retirement-overhaul-replacement to on-condition maintenance.

In the remainder of this article, I will talk about some of the fundamental principles of RCM. Then, in a follow-up article next month, I will explore how these principles can be applied to our piston-powered airplanes.

As we delve into the theory of RCM, keep in mind that this is hardly a theoretical matter. My maintenance management firm (http://www.savvymx.com/) is presently managing 150 piston-powered owner-flown aircraft using RCM, and we’re saving the owners thousands of dollars each year in reduced maintenance costs. The airlines and military have been using RCM for decades and saving a fortune on maintenance, and it’s high time that this approach trickled down to the low end of the aviation food chain!

Functions and Failures

Each system and component of an airplane performs one or more functions. The purpose of maintenance is to ensure that those items continue to perform their functions to an acceptable standard of performance. In some cases (e.g., ability to withstand g-loads), the acceptable standard of performance is established by the FAA during aircraft certification; in other cases (e.g., dispatch reliability), the acceptable standard is established by the aircraft owner or operator. The purpose of maintenance is to ensure that each item continues to meet its performance standard.

Before we can establish a rational performance standard for a component, we need to examine the consequences of failure. For a component whose failure is likely to result in death or injury (e.g., a wing spar), the likelihood of failure must be infinitesimally low. On the other hand, for a component whose failure is simply an inconvenience (e.g., the #2 comm radio), a higher failure probability is acceptable.

From a maintenance standpoint, we must do whatever it takes to prevent the failure of safety-critical items like wing spars and engines, even if it’s expensive to do so. On the other hand, it’s usually not worth spending any time or effort to prevent the failure of a non-critical item like a #2 comm radio; we just run the item to failure and then fix it when it fails.

Often, the consequences of failure depend on the component’s operating context. The failure of dry vacuum pump is much less critical if the aircraft has a standby vacuum pump or an electric backup attitude indicator. The failure of an engine is considerably less critical on a four-engine airplane than on a single-engine airplane. The failure of a wing spar is less critical if the wing has a fail-safe multiple-spar design.

RCM classifies the consequences of failure into four categories, in descending order of importance:

  • Safety consequences. A failure has safety consequences if it could kill or injure someone. (E.g., engine failure in a single-engine airplane, failure of primary structure in any airplane.)
  • Operational consequences. A failure has operational consequences if it prevents the aircraft from being operated (AOG). (E.g., failure of a magneto, primary alternator, or primary fight display.)
  • Hidden consequences. A failure has hidden consequences if it is not apparent to the flight crew, but could cause a subsequent failure to have more serious consequences. (E.g., failure of a backup voltage regulator, standby vacuum system, or backup alternator.)
  • Non-operational consequences. Failures in this category are evident to the flight crew, but impact neither safety nor operation, and involve only the cost of repair. (E.g., failure of the #2 comm radio or #2 GPS navigator.)

Feasible? Worth doing?

RCM does not require that all failures be prevented. It recognizes that not all failures are created equal, and that maintenance resources should be focused on reducing failures that really matter. RCM concentrates on preventing failure of items with safety or serious operational consequences, and detecting hidden failures so that they can be corrected in a timely fashion. For failures with non-operational consequences, the optimum course of action is often reactive rather than proactive (i.e., fix it only when it fails).

For a failure with safety or operational consequences, RCM attempts to prevent the failure by identifying a proactive maintenance task to be undertaken before the failure occurs. Such proactive tasks may involve scheduled overhaul, scheduled replacement, or on-condition maintenance. However, in order for such a proactive task to be adopted, it must first be shown to be both technically feasible and worth doing:

  • Technically feasible. A task is considered technically feasible if it reduces the consequences of the associated failure to an extent that is acceptable to the owner or operator of the aircraft. (In other words, it gets the job done.)
  • Worth doing. A task is considered worth doing if it reduces the consequences of the associated failure to an extent that justifies the direct and indirect costs of doing the task. (In other words, it is cost-effective.)

If it is not possible to find a proactive task that is both technically feasible and worth doing, then the failure must be dealt with reactively. That might mean corrective maintenance (fix it only when it breaks), failure finding (scheduled functional checks to detect hidden failures), or redesign (e.g., install a backup).

Age-Related Failures

Many aircraft owners, mechanics, and even aeronautical engineers still believe that the best way to optimize reliability of complex aircraft systems (e.g., engines) is to do some kind of proactive maintenance on a routine scheduled basis. Conventional wisdom is that this should consist of overhauls or replacement at fixed intervals. Figure 1 illustrates this fixed-interval view of failure:

This traditional view assumes that most items operate reliably for some fixed period of time (“useful life”), after which the probability of failure starts to increase rapidly (“wear-out zone”). It is predicated on the notion that analysis of failures will allow us to predict the useful life of an item and take scheduled action to overhaul or replace it before it reaches the point where risk of failure becomes unacceptable.

This traditional view is valid for components that have a single, dominant, age-related failure mode. For example, the failure pattern illustrated in Figure 1 is appropriate when considering an item that normally fails from metal fatigue due to repetitive stress, such as a wing spar or cylinder head.

In this traditional view, the probability of failure during the item’s useful life is usually small but nonzero. Therefore, a modest number of premature failures can be expected before the item reaches the end of its useful life, at which point the probability of failure starts increasing.

For safety-critical items like wing spars whose failure have extreme safety consequences (if it fails, you could die), the traditional approach is to establish a very conservative “safe life limit” that ensures the item is retired before the probability of failure reaches some very low threshold (e.g., one in a thousand, one in ten thousand). This is illustrated in Figure 2:

However, RCM researchers determined decades ago that very few aircraft components and systems exhibit a pattern of failure that corresponds to the traditional view. For example, many complex components have a failure pattern that looks more like Figure 3:

In this pattern—known a “bathtub curve” for obvious reasons—the component exhibits a high risk of failure when first placed in service, commonly known as “infant mortality.” Once the infant mortality period has passed, the probability of failure drops to a low level for the remainder of the item’s useful life, after which it rises as the item is continued in service into the wear-out zone.

This is commonly accepted to be the failure pattern associated with piston aircraft engines, although I don’t think that’s quite correct. (Much more about this in Part 2 of this article next month.)

The Six Failure Patterns

One of the most fascinating findings by RCM researchers is that there are actually six different failure patterns exhibited by various mechanical, electrical and electronic aircraft components. These are illustrated in Figure 4:

  • Pattern B corresponds to the traditional view of age-related failures. It depicts a constant or very slowly increasing failure probability, followed by a pronounced “wear-out zone” where the probability of failure increases rapidly.  This corresponds to the traditional view of age-related failures. However, RCM studies of civil aircraft found that only 2% of all items actually conform to this failure pattern, including items whose dominant failure mode is repetitive-stress metal fatigue. For such items, a fixed age limit (safe life or TBO) may be appropriate and desirable.
  • Pattern A, the bathtub curve, accounts for another 4% of items. This failure pattern depicts a high-risk infant-mortality period, followed a constant or very slowly increasing failure probability, and then followed by a pronounced “wear-out zone.” Such items may also benefit from a fixed age limit, provided the number of premature failures is small enough that the majority of items survive to TBO.
  • Pattern C depicts a failure probability that gradually increases with age, but with no obvious wear-out zone or useful life. Approximately 5% of all items exhibited this pattern. It is not usually desirable to impose a fixed age limit on such items.
  • Pattern D depicts a failure probability that is low when the item is new or newly overhauled, then increases to a constant level that continues as long as the item remains in service. This pattern accounted for 7% of all items.
  • Pattern E depicts a constant failure probability—in other words, the conditional probability of failure is unrelated to age, and occurs randomly. 14% of all items exhibited this pattern.
  • Finally, Pattern F depicts a high-risk infant-mortality period, followed by a constant or very slowly increasing failure probability, with no apparent wear-out zone or useful life. RCM studies showed that a whopping 68% of all items in civil aircraft exhibit this pattern, particularly electronic equipment.

These findings contradict the traditional belief that reliability is predominantly age-related, and that the more often an item is overhauled or replaced, the less likely it is to fail. RCM studies show clearly that unless there is a dominant age-related failure mode (e.g., metal fatigue), age limits and scheduled overhauls do little or nothing to improve reliability. In fact, for the 72% of items that exhibit failure patterns A and F, scheduled overhaul or replacement can actually increase overall failure rates by introducing infant mortality risk into an otherwise reliable system or component. 

RCM shows that fixed age limits and scheduled overhauls are technically feasible only if:

  • there is an identifiable age (TBO) after which the item shows a rapid increase in the conditional probability of failure (i.e., an obvious wear-out zone); and
  • most of the items survive to that age; i.e., there are relatively few premature failures.

On-Condition Maintenance

Although RCM has revealed that there is often little or no relationship between time-in-service and likelihood of failure, most failures give some sort of warning that they are about to occur. If we can detect these warnings in time, we may be able to take maintenance action to prevent the failure and avoid its consequences; see Figure 5:

If a developing failure can be detected somewhere between point P (where it first becomes detectable) and point F (where total failure occurs), it may be possible to take action to prevent the consequences of the failure. Whether or not it is technically feasible to do this depends on how quickly the failure occurs, how far in advance it becomes detectable, and how difficult it is to detect the potential failure. On-condition maintenance consists of checking for potential failures so that action can be taken to prevent functional failures before they occur.

The warning period between the occurrence of a detectable potential failure and its decay into a total functional failure is known as the “P-F interval” in RCM-speak. It may be measured in hours, cycles, calendar months, or any other appropriate metric. In order to detect failures reliably before they occur, on-condition maintenance tasks must be performed at intervals that are less than the P-F interval. In practice, it is usually optimal to implement a task frequency that corresponds to about one-half of the P-F interval. (E.g., if the P-F interval is 100 hours, we need to inspect every 50 hours to ensure that we will detect a potential failure in plenty of time to avert a total failure.) Such condition monitoring is considered to be technically feasible if:

  • it is possible to identify a well-defined and reliably detectable potential failure condition;
  • the P-F interval is reasonably consistent and predictable; and
  • it is practical to inspect or monitor the item at an interval approximately one-half the P-F interval.

Let’s Get Real

So much for the philosophy and theory of Reliability-Centered Maintenance. Next month, we’ll get right down to the nitty-gritty of how we can apply these principles to the maintenance of our piston-powered general aviation airplanes, with special emphasis on piston aircraft engines.

Meantime, if you’d like to learn more about the technical aspects of RCM, I recommend you obtain a copy of John Moubray’s book [Reliability-Centered Maintenance, John Moubray. Second edition. 1997. ISBN 0-8311-3078-4.] from Amazon or Barnes & Noble. For a more aviation-oriented discussion, the original 1978 Nowlan & Heap report is available from the National Technical Information Service (http://www.ntis.gov/), a division of the Department of Commerce—the NTIS document number is ADA066579.

You bought a plane to fly it, not stress over maintenance.

At Savvy Aviation, we believe you shouldn’t have to navigate the complexities of aircraft maintenance alone. And you definitely shouldn’t be surprised when your shop’s invoice arrives.

Savvy Aviation isn’t a maintenance shop – we empower you with the knowledge and expert consultation you need to be in control of your own maintenance events – so your shop takes directives (not gives them). Whatever your maintenance needs, Savvy has a perfect plan for you: