A Risk Based Approach to Maintenance Optimisation of Business Critical Railway Structures/Equipment
Ujjwal Bharadwaj, Loughborough Univ/TWI Ltd
John Wintle, TWI Ltd
Vadim Silberschmidt, Loughborough University
Paper presented at Railway Network Younger Members Best Paper Competition'- Seminar on Application of new technologies to Railways. London, 28 Nov 2007.
Railway infrastructure managers are under increasing pressure to minimise life cycle costs whilst maintaining reliability or availability targets, and to operate within safety and environmental regulation. This paper presents a risk based decision-making methodology for undertaking run-repair-replace decisions with the ultimate aim of maximising the Net Present Value (NPV) of the investment on such maintenance. This methodology is based on established engineering, financial and statistical techniques that are in practice in power plant management. In this paper, the railway system under consideration is assumed to consist of a number of structural components.
To demonstrate this approach, in the first instance, a qualitative risk analysis is conducted to highlight those components that are 'high risk'. This enables operators to give an overview risk profile of their system and thereby focus resources on the more risky system components (structures). To analyze these risky structures, a quantitative risk analysis is performed on each of them. For simplicity, corrosion has been considered as the main damage mechanism affecting railway structures. A basic probabilistic model is developed to obtain remaining life (RL) estimates of the structure under consideration. The RL estimates are then fed into another model - the Cost Risk Optimisation (CRO) -model that weighs the risk of the structure being out of service (measured in monetary units as the product of the probability of being out of service and the cost of its consequences) with the cost of risk mitigation by undertaking some action - repair or replacement. Using this risk based approach, the CRO model gives the optimum time of action such that financial benefit is maximised. NPV is used to assess the value of an action so that time value of money is taken into account. In addition, the model can take into account tax credits accruing due to the depreciation of the structure (if applicable).
If there are a number of structures or sub-units and a fixed budget, as is often the case, the model finds the schedule of action using the same risk based approach.
Maintenance manages the process of ageing of a structure or machinery. Ageing in the current context is a process that manifests itself in an increase in the failure rate within a system comprising a number of sub-systems.
Fig.1. The bath tub curve
Figure 1 depicts the 'bath tub' curve which is an idealized curve depicting the failure rate within a system consisting of a large number of components as a function of the system's operational life. During the initial stage -the Infant Mortality stage - 'teething problems' cause the failure rate to be high. As these problems, which are mainly due to design and/or manufacturing, are identified and solved, the failure rate decreases. The structure then enters the next phase of its life - the Useful or Normal life, during which the failure rate is kept almost constant by prudent maintenance. The structure enters its third phase - the Ageing phase - when the failure rate starts to rise. The failure rate during this phase is high because of the accumulated time-dependent damage during the previous stages and ongoing damage.
This paper develops and demonstrates a risk-based methodology to estimate the optimum time of replacement or repair of a structure, for a given a number of such structures, and limited budgetary support.
The paper starts with a discussion on risk and its analyses - qualitative and quantitative - the concept of NPV, and moves on to the trade off involved in risk-based decision-making within budgetary constraints.
2. The risk based approach to maintenance
Risk has numerous definitions. According to The American Petroleum Institute (API), risk is a combination of the probability of an event and its consequence. It is a deviation from the normal or expected. Numerically, it is a product of probability of an event occurring and the consequence of the event.
A risk-based approach considers failure taking cognizance of the two elements that constitute risk - the probability (or likelihood) of failure and the consequence of that failure. Figure 2 shows a two dimensional risk profile of ten components of a bridge.
Fig.2. Risk Plot of several structures within a system
The probabilities and the consequences of failure of ten components have been determined and presented as points on a Risk Plot. The probability of failure and its consequence may be estimated qualitatively in which case the axes of abscissas and ordinates in Figure 2 would represent bands of Very Low (VL), Low (L), High (H) and Very High (VH) consequences and likelihood of failure, respectively. In case of quantitative analyses, the axes would have numerical data with probability of failure on the Y-axis of ordinates and, for example, the loss of revenue as the consequence of failure on the X-axis of abscissas. An iso-risk line is also plotted representing a constant risk level as defined by the operator according to their perception of what is an acceptable threshold level of risk. The iso-risk line separates areas of acceptable and unacceptable risk components, enabling railway operators to focus their maintenance resources on the relatively more risky structures.
Risk analysis is the systematic use of information to identify sources of risk, and to estimate the risk of failure. It forms the basis for risk evaluation, risk mitigation and risk acceptance (or risk avoidance). Information used in risk analysis usually includes historical data, theoretical analysis, informed opinions and stakeholder concerns.
3. Risk analysis methods
Risk analysis methods are generally categorised as qualitative or quantitative. There may be an intermediate category (semi-quantitative) depending upon how quantitative the risk analysis is. The API's Recommended Practice 580  on risk-based inspection describes a 'continuum of approaches' ranging from the qualitative to quantitative, Figure 3. The figure depicts the level of detail in risk analysis corresponding to a purely qualitative approach on one end of the spectrum, to the purely quantitative one on the other, with intermediate approaches in between.
Fig.3. Continuum of Risk Analysis methods
3.1 Qualitative analysis
Qualitative analysis uses engineering judgement and experience as the basis for risk assessment. The results of the analysis largely depend on the expertise of the user. The primary advantage of qualitative risk analysis is that it enables assessment in the absence of detailed numerical data. It is also the first pragmatic step to conduct a quantitative risk analysis by screening out components of less concern. Moreover, the results can serve as a reality check on the outcome of quantitative analysis. However, it is not a very detailed method and provides only a broad categorization of risk. Failure Modes, Effects and Criticality Analysis (FMECA), Hazard and Operability Studies (HAZOPS), and the Risk Matrix approach are examples of qualitative methods. In the Risk Matrix, approach, the likelihoods and consequences of failure are qualitatively described in broad ranges (e.g. high, medium or low). 
3.2 Quantitative analysis
Qualitative risk assessments become less discerning when the system complexity increases. So a quantitative method is often required to achieve risk discrimination of a system of components. Quantitative Analysis assigns numerical values to the probability (e.g. 10-5 failure events per year) and the consequences of failure (e.g. revenue loss or inventory released over 1,000 m2). Qualitative Analysis techniques such as FMECA and HAZOPS can become quantitative when the values of failure consequence and failure probability are numerically estimated. This can be performed using a variety of references such as generic failure databases, elicited expert opinions, or calculated by means of specific engineering and statistical analysis ASME.  There are statistical methods for combining data from various sources or updating data with additional information, Jordan,  Kallen and Noortwijk,  Khan et al -  .
In the current discussion, it is assumed that there is a system comprising a number of structures and a qualitative analysis, as shown in the previous Section, has identified the high-risk components.
For the Quantitative Risk Analysis method proposed in this paper, a failure frequency-time curve for the Ageing period of life is developed by engineering analysis of the structure for the active or potentially active in-service damage mechanisms, e.g. corrosion. The consequence of failure is in financial terms as it is assumed that the functioning of the structure is business-critical.
4. Cost Risk Optimisation (CRO)
The next step is the calculation of the optimum action schedule or date, of the run-repair-replace action. This calculation weighs the financial benefits of maintenance action against the risk (as expressed in costs) of not taking the action. The ultimate aim is to maximise the net present value of the investment (i.e. the maintenance action) by adjusting the date of the action.
If there are a number of structures and limited budgetary resources, then Multiple Structures CRO priorities action based on the same cost risk optimization as discussed above.
5. Decision-making using financial criterion
5.1 The need for financial criterion in maintenance decision making
Maintenance projects are increasingly being evaluated by decision makers who need to understand the implications of various options in financial terms. Although predictive maintenance techniques have matured, the predictions are in engineering terms, and these are not easily understood by financially oriented decision makers. Thus, there is a need to express engineering wear and tear in financial terms. In the current context, this is done by evaluating the cost of consequences.
5.2 The drivers for a consistent decision making methodology in maintenance
Many old plants, structures, capital equipment or components are in their Ageing period of life. However, increasing competition means many of them cannot be replaced and need to have their useful life extended. In addition, new components are often designed to operate with maximum efficiency, and are designed with lower 'margins of error' against assumed operating conditions.
Each action (or project) has costs associated with it. These costs are, in essence, investments made by the concerned asset owner with the expectation of certain return on the investment(s). The decision maker will normally be faced with a number of projects competing for such investments, and therefore needs to take decisions that maximise the returns on these investments. The most widely understood financial techniques to evaluate projects include 'return on investment', 'pay-back period' and 'discounted cash flow (DCC)' methods, Brealey and Myers.  These techniques have various strengths and weaknesses. This paper employs the net present value (NPV) technique that is a form of DCC analysis.
6. Maximizing NPV using probabilistic damage mechanism models
6.1 NPV Financial analysis
In the current discussion, it is assumed that a project with a higher NPV is a better investment than a project with a lower NPV. The NPV of a project is the present (current) value of the total future cash flows, both positive(income) and negative (cost). NPV considers the time value of money by discounting all the cash flows, and it is calculated as follows:
where N is a project life (years); t is timing of cash flow (year); r is an interest rate, or discount rate; and Ct is a cash flow in year t.
The future cash flows are expected cash flows, as they do not occur with certainty. The uncertainty arises in the engineering analysis to calculate the probability of failure over time for the damage mechanism(s) of interest.
The risk associated with any project is finally expressed in terms of its NPV by using expected values (EV). The EV of a failure event is the product of the probability of the event occurring and the cost of consequence of that event.
The cost of consequence of the failure event is directly assessed from a prior quantitative consequence analysis, and it must be expressed in financial terms.
Thus, the NPV of a project with uncertain outcomes is the sum of the expected values of all future discounted cash flows, as follows:
where pt is the probability of the event occurring at time t.
6.2 Probabilistic damage mechanism model
This paper does not discuss the details of damage models for use in probabilistic analysis. Instead it illustrates a simple probabilistic damage mechanism model for general corrosion of a railway structure. Consider a structure subjected to corrosion, say, the tower structure of a bridge. Assuming this to be the only damage mechanism causing failure of the structure, the remaining life of the structure can be calculated as
Where RL is a remaining life (years); Tc is a current thickness of the structure (mm); MAT is a minimum allowable thickness (mm); and CR is a corrosion rate (mm/year).
CR is derived from periodic in-service measurements of metal loss resulting from corrosion. Tc is known from the most recent thickness measurements on the structure (or at the start of the structure's life, Tc can be assumed to be equal to the original nominal thickness of the structure as specified by the designer including tolerances, corrosion allowance, etc). MAT is the absolute minimum thickness calculated by the designer to prevent failure by overload, collapse, etc. as appropriate.
The convention is to calculate RL in a deterministic manner, whereby each independent variable in Equation 3 is a specific value. This assumes that these variables have no random or probabilistic aspects but can be defined in a fixed predictable fashion. In reality, there is considerable uncertainty associated with these variables, and each can be defined by a statistical distribution of values.
In the method here, a statistical analysis tool (i.e. Palisade's @RISK for Microsoft Excel) is used to describe all the independent variables probabilistically, and RL is then calculated using Monte Carlo Simulation (MCS) technique. In this way, the calculated RL by MCS is actually a distribution of values, so that the annual probability of failure (the failure rate per year) over time can be obtained. This probability versus times curve may then be used to derive the EV, where the EV of a failure event is the product of the probability of the event occurring and the cost of consequence of that event, at a specific point in time.
6.3 Cost Risk Optimisation
The key inputs to the optimisation model are as follows:
(a) The expected present value of the proposed action (replacement or repair of the asset); (b) the expected present value of inaction which is equal to the expected present value of the production losses avoided as the result of undertaking the proposed action; (c) any financial constraints, such as the annual maintenance budget limit; and (d) any non-financial constraints, e.g. on failure rates due to safety regulations.
Thus, for the NPV of an action taken at time t=n, the following can be defined:
CBt - Cash flows associated with production in year t;
CPt - Cash flows associated with implementing the project in year t, including any tax credits (positive cash flow) on depreciation costs ;(Collier, Glagola 1998) 
N - the maintenance planner's strategic planning period;
n - year in which the action is proposed to be undertaken;
pt - probability of the event (failure) not occurring in year t; and
r - interest rate, i.e. the cost of money (finance).
In the current context, NPV of action in any year 'n' is given by:
NPV= (Expected present value of action) + (Expected present value of inaction) (4)
Assuming that cash outflows are negative and cash inflows are positive, and failure results in production loss
The optimization algorithm calculates the year of maintenance action, for which NPV is maximum (least negative), subject to stipulated constraints.
The maintenance action may be replacement or repair. In case of replacement, the equipment/component begins its life cycle from its Infant Mortality Stage through to the Ageing Stage. In the current model, it is assumed that repair improves the condition of the equipment such that it returns to a stage prior to the Ageing Stage i.e. the Normal Life Stage or, preferably, the Infant Mortality Stage. It may be noted here that the implications in terms of tax credits on equipment depreciation may be different in case of replacement compared to repair and these need to be reflected in the above equation accordingly.
Fig.4. Failure probability over time
7. Demonstration of the model
The approach described above is demonstrated by evaluating three structures subjected to corrosion. The failure frequency from a probabilistic corrosion model is presented in Figure 4. The optimized replacement time for Structure #1 is shown in Figure 5. The year, in which the 'Action NPV' is maximum (least negative) has been calculated. NPV is obtained by considering the expected net present value of: (a) cash flows resulting from the replacement of the structure; (b) the avoided lost production outage cost due to replacement, i.e. the cost of inaction. The optimum action date for Structure #1 is 2013.
Fig.5. Optimized action time for Structure#1
Figure 5 shows the application of the risk-based approach to maintenance of Structure #1. The probability of failure-time curve derived from remaining life estimates on its own is insufficient to the decision maker. The Action NPV- time curve generated by the risk-based approach enables the user to make a more informed maintenance decision by considering also the consequences of failure in conjunction with the probability of failure. The optimal action date is the time when the NPV of the action in maximum (2013 for Structure #1).
The optimised action years for other two structures are derived in the same way. To determine the optimised action years for all the three structures within a budgetary constraint, the Solver in MS Excel was used. The resulting optimised schedule is shown in Figure 6.
Fig.6. Action schedule for three structures Str#1, Str#2
and Str#3 subject to budgetary constraints
|Description||Action Year||Capital Cost||NPV|
8. Limitations of the model
For more complex systems with an increasing number of components and constraints, non-linear optimization tools (i.e. based on genetic algorithms) are more powerful than the linear solver in Microsoft Excel may be required;
There are economic dependencies in maintenance, so with increasing dependencies, the methodology will become more complex and more computing power may be required for the analysis;
For safety critical systems, the constraints on the failure probability may be so severe, as to cancel out any potential financial benefits from applying the methodology.
This methodology is essentially for proposed action during the Ageing Phase of a plant in which failure is primarily due to accumulated damage. Thus the method needs to be used in conjunction with whole life cycle maintenance methods rather than in isolation.
Risk-based maintenance optimization often requires a detailed analysis using quantitative techniques. The proposed methodology uses engineering analysis by developing a basic probabilistic damage mechanism model to obtain failure rates. The resulting failure rates over time are used to calculate expected present values of cash flows before and after selected maintenance actions (e.g. equipment replacement).
It has been shown that the optimum year of replacement can be calculated when the net present value (NPV) of the maintenance action is maximised. If there is a budgetary constraint that does not allow for a series of actions in a system of structures to be undertaken in a given strategic planning period, cost risk optimization of multiple structures can be easily undertaken using the approach.
Other damage mechanisms such as fatigue (using remaining life estimates from fatigue sensors) can be incorporated in the method described here.
The authors wish to acknowledge the financial support of TWI Ltd and the CICE (Centre of Innovative and Collaborative Engineering) at Loughborough University for this research.
- API 2002, Risk-based Inspection, API Recommended Practice 580, American Petroleum Institute, USA.
- ASME International 2003, Risk Based Methods for Equipment Life Management, CRTD Vol. 41, ASMW, New York
- Jordon, I., 2005, 'Decisions Under Uncertainty: Probabilistic Analysis for Engineering Decisions.'
- Kallen, M.J. & Noortwijk, J.M.V. 2005, 'Optimal Maintenance decisions under imperfect inspection, Reliability Engineering & System Safety.'
- Khan, F.I., Haddara, M.M., Bhattacharya, S.K.2006, 'Risk Based Integrity and Inspection Modelling (RBIIM) of Process Components/System', Risk Analysis, Vol. 26, no. 1.
- Brealey, R.A. & Myers, S.C.1991, Principles of Corporate Finance, 4th Edition, McGraw-Hill, New York.7.
- Collier, A.C. & Glagola, R.C.1998, 'Engineering Economics and Cost Analysis', 3rd Edition, Addision-Wesley, California, pp. 438-464.