Failure Theory

Reliability of Systems, Equipment and Components

A reliable piece of mechanical equipment is understood to be basically sound, to be able to meet its design specifications, and to give trouble-free performance in a given environment. However, it is necessary to have an understanding of the technical, engineering use of the term ‘reliability’ as specified for mechanical equipment. All plant, equipment and components have a finite life, and so eventually all pieces of equipment will fail. Without a technical definition of reliability it would not be possible for engineers or managers to make meaningful comparisons between the reliability of alternative plant and equipment.

Definitions

Reliability

Reliability is the probability that a plant or component will not fail to perform within specified limits in a given time while working in a stated environment. For the case of mechanical systems the following definition can be used to define reliability

Mechanical Reliability is the probability that a spare, item, or unit will perform its prescribed duty without failure for a given time when operated correctly in a specified environment.

The definition of reliability includes a number of variables that are external to the artefact being analysed. Identical equipments may have very different duty requirements such as frequent stop-starts or continuous running. Environmental conditions such as fine dust can also effect a machine. It is thus necessary to understand completely the operating conditions under which an artefact is expected to operate.

Maintainability

Once a piece of equipment has failed it must be possible to get it back into an operating condition as soon as possible, this is known as maintainability. To calculate the maintainability or Mean Time To Repair (MTTR) of an item, the time required to perform each anticipated repair task must be weighted (multiplied) by the relative frequency with which that task must be performed (e.g. no. of times per year). MTTR data supplied by manufacturers will be purely repair time which will assume the fault is correctly identified and the required spares and personnel are available. The MTTR to the user will include the logistic delay as shown in below.

 

Availability

The probability that an item, under the combined influence of its reliability, maintainability and maintenance support, will be able to fulfil its required function over a stated period of time, or at a given point in time. The operating context of a piece of equipment will determine its performance requirements. An airliner will be expected to reach its destination once it has taken off. To guarantee this it will spend a relatively large amount of its time being serviced. In this case the reliability must be 100% but it’s availability may be relatively low. In process industries which run continuously the availability is of prime importance. The definition of availability is:

The availability of a system is the probability that the system is functioning at time t. The Availability for a single machine is:

A = MTBF/(MTBF+MTTR)

The implication of this formula is that high availability can be obtained either by increasing the MTBF, and hence the reliability, or improving the maintainability by decreasing the Mean Time To Repair (MTTR). The MTTR would include the repair time and the logistic delay (obtaining labour and spares).

Repairable & Non Repairable Systems

In reliability engineering repairable and non repairable systems are treated differently. A non repairable item is replaced with a new item which will be as good as the item replaced. Over time as items are replaced with identical new items the failure rate will remain constant. In a repairable system the repaired item following a breakdown will not be as good as when it was first installed (as good as new (agan)) General wear and errors in the repair carried out will result in the failure rate increasing over time.

Reliability Function

Some batches of components can display a constant failure rate Alternatively some items can contain many spares of varying failure rates. With the mixing of failure rates of new and old spares within the unit the failure rate of the unit can be constant (known as a ‘pseudo-constant’ failure rate. In these situations:

R(t) = e-lt

R(t) = Reliability at time t

λ = 1/MTBF

and so

R(t)  =  e -t/MTBF

Where:               t           =     time since the last failure

                           MTBF             =     Mean time between failures

Example

A motor is required to run for two years without a failure

Manufactures MTBF                 2 years

   =          e-2/2      =                      0.36

MTBF 2 years                R(t) =   0.36     Approx. two in three chance that it will fail in
                                                               the first two years. The manufacturers
                                                               guarantee of a MTBF of 2 years is not
                                                               adequate to give a high probability of survival
                                                               for two years.

Manufactures MTBF                 10 years

   =          e-2/10     =                      0.82

MTBF 10 years              R(t) = 0.82     82% chance that it will run for two years without failing

Reliability Block Diagrams (RBD)

Up to now the reliability of individual spares has been discussed. A number of spares will make up an item which in turn make up a unit, plant area, and then an entire plant. In industry it is necessary to estimate the reliabilities of equipments that interact with each other. Before any form of reliability analysis is attempted, it is necessary to represent the system under consideration, as a block diagram. A block diagram with an individual block for each unit can represent the entire plant. If necessary lower level block diagrams can represent items within units. Two basic types of diagrams can be used to represent a system.

A functional block diagram (system layout) can represent the actual plant layout showing how plant units are interconnected. This helps to describe how the system is expected to operate.

For reliability assessment a Reliability Block Diagram (RBD) is more useful.

System Configurations

It is necessary to have an understanding of the basic types of unit layout that exist in industry

Series System

In a series system, failure of any unit constitutes system failure. The reliability of the system is the product of the reliabilities of the units making up the system

Placing units in series increases the failure rate and reduces the overall availability of the system

λsystem = λA + λB + λC

Avsystem = AvA x AvB x AvC

Full Active Redundancy

In an active redundancy system, a number of units sustain the function until one fails; the remaining unit can continue to provide the function

Note: If both of the pumps must be operational to sustain the function, in reliability terms they would be in series despite being in parallel on the system diagram.

The System Reliability Rs for one out of two units in parallel is:

Rs = 1-{(1-RA)(1-RB)}

M-out-of-N

In these cases individual units share provision of the function, which can be sustained at a satisfactory level should one or more of the units fail. In active and standby redundancy systems this can be known as m-out-of-n models. As shown below, at least m units out of a total of n must be in operation for the system to operate.

Example:

Three units are in active parallel, of which at least 2 must be in operation for satisfactory system operation.

System Reliability      

Rs = R1R2R3 + R1F2F3 + R2F1F3 + R3F1F2

If  R1 = R= R3

Then  RS = R3 + 3RF2

The “Bath Curve”

Many differing batches of mechanical and electrical industrial components have been tested to determine if it is possible to predict when they will fail. These tests have revealed that during their normal working life, they do not reach a point of wear-out at some likely time that could be called “old age”. On the contrary a given item is as likely to fail in a given week shortly after installation as in a given week many months later. This probability of failure which is known as the failure rate (symbolised by the Greek letter lamba λ) can go through three distinct failure patterns. Batches of components can display one, two or all three of these patterns (stages) through their life time.

In the first of the three stages, the failure rate plunges downward rapidly from a very high starting point - this is “infant mortality”. Failure during this stage can be attributed almost entirely to manufacturing & installation defects. Failure caused by manufacturing defects or poor installation tend to show up almost immediately, accounting for the high starting point. The term “Burn In” which can also be used to describe this period, comes from the computer industry where new machines are run in a hot environment before dispatch. Any hardware faults will show up quickly in this elevated temperature. Once a machine passes it shall have a long trouble-free life. Equipment can also return to the infant mortality stage after maintenance intervention. For various reasons, equipment can suffer problems as a result of maintenance. Planned maintenance can actually reduce its availability.

Example: a group of similar bearings are changed every year as part of a planned maintenance activity. If they were condition monitored and changed on showing signs of imminent failure, it would be found that these bearings have an average life of 2.5 years. This over maintaining then results in increased probability of failure due to the infant mortality after each maintenance activity.

As the curve levels off, it enters the second stage that is a straight segment indicating an essentially constant failure rate. In the final stage, the failure rate climbs sharply as spares wearout.

In reality the bath curve has little application in industrial process. Very simple components can follow this failure pattern such as a light bulb where faulty manufacture may result in very short life. A rare example of a more complex system that follows the bath curve is a petrol engine. Engines have to be taken care of in their early life to allow them to “bed in”. Following this they go through a period of constant failure. Most petrol engines then fail following a life of 100,000 to 150,000 miles. Even this so called wearout period can cover a significant part of the engine's life.

Six Patterns of Failure

During the development of the Boeing 747, batches of aircraft components were tested to determine their failure patterns The results are displayed above.

Pattern A is the well-known bath curve.

Pattern B shows constant or slowly increasing failure probability ending in a wear-out zone.

Pattern C shows slowly increasing probability of failure, but there is no identifiable wear-out age.

Pattern D shows low failure probability when the item is new or just out of shop, then a rapid increase to a constant level.

Pattern E shows a constant probability of failure at all ages (random failure).

Pattern F starts with high infant mortality, which drops eventually to a constant or very slowly increasing failure probability.

In highly complex equipment, such as an aircraft, infant mortality followed by random failure is the dominating failure pattern as shown by the above studies carried out on civil aircraft. Within manufacturing industry, approximately 30% of industrial failures are related to age as in A & B. With increasing complexity of modern equipment the failure pattern from industry will more closely match the studies from the aircraft industry.

The Nature of Failure

Failure pattern B depicts age-related failures as in the figures from the aircraft industry very few failures show a relation-ship with age. An example would be abrasion, e.g. the abrasive action of piston rings on the cylinder walls of a reciprocating engine.

Failure pattern C shows a steadily increasing probability of failure, but there is no one point at which we can say, “that’s where it wears out”. Cyclic stresses resulting in fatigue are the main cause of this failure pattern.

Failure Pattern E is pure random failure. All the empirical evidence shows that rolling element bearings usually conform to a random failure pattern. However it is still possible to compute a mean time between failure (MTBF) for such items. It is given as the point at which 63% of the items have failed. Often poor MTBF for bearings can be attributed to poor choice and/or fitting.

Failure pattern F is the most common failure pattern and like the bath curve shows a failure rate decreasing with age before going into a period of random failure. The high infant mortality has a variety of causes:

  • poor design
  • bad workmanship
  • incorrect installation
  • poor reassembly
  • incorrect commissioning
  • incorrect operation
  • invasive maintenance
  • poor quality manufacture
  • unnecessary routine maintenance
  • cleanliness

Bell Curve

Some engineers and managers tend to be over optimistic about the effectiveness of Planned Maintenance (PM). There are limitations to PM:

  • the limitations set by random failure events. Random machinery failure events, according to their definition, could occur with equal probability in time. Identical components could be as likely to fail after 1 week as 5 years after installation. In effect, they are always as good as new. There is no period of time after which it would be effective to change these components.
  • the life dispersion of machinery components. Even time dependent failures are not all that predictable. They do not appear after absolutely equal operating intervals, but after very dissimilar time periods (Figure 12). This dispersion increases as the MTBF increases.

The example below (failure distribution A) shows the occurrence of MTBF for a compressor bearing:

  • 10 bearings failed between 2.5 years and 3.5 years.
  • 20 bearings failed between 3.5 years and 4.5 years, etc.
  • From 3.25 years the incident of bearing failure started to increase
  • By 5.5 years 50% of the bearings had failed
  • Some bearings did not fail until 8.5 years

Imagine a strategy of changing all the bearings at 4.7 years.

The frequency of failure had started to increase over a year earlier and some bearings would continue to run satisfactorily for another 3.7 years. Even though this planned maintenance is 0.8 years before the average life of 5.5 years, a considerable number of failures still occur (area a + b).

This planned maintenance is expensive in that it changes the majority of bearings needlessly early (up to 3.7 years early) and it does not prevent failures as a small amount still occurs.

Planned maintenance for failure distribution “B” in which the majority of failures occur in the period 4.3 years to 7 years would be more effective. A PM strategy at 4.7 years would allow a small number of failures (area b). It again takes place 0.8 years before the average and 2 years before the maximum expected life. While in this case it is more suitable for PM than distribution “A”, it still incurs costs due to early maintenance. Distribution A type curves would be more usual in industry than distribution B. For failure distribution “B” and possibly “A” Condition Based Maintenance would be suitable.

Bell Curves of Failure Distribution
Failure Distribution A
Failure Distribution B
MTBF Years
Frequency of Occurrence
MTBF Years
Frequency of Occurrence
0.5-1.5
10
0.5-1.5
10
1.5-2.5
10
1.5-2.5
10
2.5-3.5
10
2.5-3.5
10
3.5-4.5
20
3.5-4.5
10
4.5-5.25
80
4.5-5.25
30
5.25-5.75
100
5.25-5.75
180
5.75-6.5
80
5.75-6.5
26
6.5-7.5
20
6.5-7.5
4
7.5-8.5
10
7.5-8.5
0

Potential Failures and the P-F Curve

With a growing awareness of the random nature of many failures, condition monitoring techniques are becoming more popular. Modern condition based maintenance practices like vibration analysis rely on the fact that many failures do not occur instantaneously, but actually develop over a period of time. If evidence can be found that this failure process is under way, it may be possible to take action to prevent failure and/or avoid the consequences. At the early stages of failure it is not possible to detect the signs of an impending failure. As the failure develops, there comes a point at which it can be measured. At this stage the failure has not yet reached a stage in which the equipment is unable to operate effectively. However the maintenance department has now the opportunity to correct the fault, before it becomes more serious. The point in the failing process at which it is possible to detect that the failure is occurring or is about to occur is known as a potential failure (potential because a serious failure has not yet occurred). Examples include:

  • hot spots on bearing housings and in electrical panels.
  • vibrations indicating imminent bearing failure
  • particles in gearbox oil showing imminent gear failure
  • visible leaks and wear.

The P-F curve (shown below) shows how a failure starts, deteriorates to the point at which it can be detected (the potential failure point “P”), if it is not detected and corrected, continues to deteriorate - usually at an accelerating rate - until it reaches the point of functional failure (‘F’) The P-F interval can be known as the “Lead Time To Failure.”

The “condition” being measured can take a variety of forms. Any condition that shows a change, as the health of the spare deteriorates, can be used. It must however, give enough of a warning between  “P” and “F” to allow actions to take place otherwise nothing will be gained by having a warning. Equipment condition being measured could include:

  • reducing pressure supplied by a pump indicating impeller wear, slip ring wear, etc.
  • increased surface temperature on the outside of an insulating surface indicating insulation deteriorating
  • increased vibration from rotating equipment; these vibrations must be analysed further to identify possible causes.

 Having identified point P then two actions can take place:

to prevent the functional failure. Depending on the nature of the failure mechanism, it is sometimes possible to intervene to repair the existing component before it fails completely.

to avoid the consequences of the failure. In most cases, detecting a potential failure does not actually prevent the spare from failing, but still makes it possible to avoid or reduce the consequences of the failure. For example the necessary spares, personnel  and equipment could be made available, or the effected part could be changed out of production time before it actually fails.

P-F Curves and inspection interval timing

The life of a component and the P-F curve are often confused. On-condition task frequencies are often based on the real or imagined “life” of the item. If it exists at all, this life is usually many times greater than the P-F interval, so the task achieves little or nothing. The component life is measured forward from the moment it enters service. On the other hand the P-F interval is measured back from the functional failure to some point that a warning of the forthcoming function failure can be detected (potential failure). The two concepts are often unrelated.

In the above example, the batch of components has a random failure pattern with two failures in the first year. Let us assume that over a large batch these components have an average life of 3 years. From condition monitoring it has been observed that these components have a P-F interval of about 4 months. Because they started to fail in the first year of service the condition monitoring tasks must commence immediately after installation on a 2 monthly basis. The timing of the inspections has nothing to do with the age or life of the component.

P-F curves can have considerable variation in length from minutes to months. A P-F curve of 4 months is desired because:

  • fewer on-condition inspections are required
  • there is more time to organise the people and materials needed to correct the potential failure
  • it is easier to plan to correct the potential failure without disrupting operations or other maintenance activities
  • it is possible to do whatever is necessary to avoid the consequences of the failure in a more considered and hence more controlled fashion.