Klaus Brun and Rainer Kurz explain root-cause analysis and how it assists operators in diagnosing turbomachinery failures.
Turbomachinery operators use root-cause analysis to identify failures. It creates a plan to identify the “root” issue that ultimately caused the failure rather than the immediate problem.
Myth Busters Klaus Burn of Ebara Elliott Energy and Rainer Kurz of RKSBenergy LLC guide readers through this process—how it works, its various methodologies, and the actions operators should (and should not) take once an analysis is complete.
TURBO: What is root-cause analysis?
Brun: Root-cause analysis, some folks call it root-cause failure analysis, looks at something that failed: Your machine had a catastrophic failure, or you’re missing performance by a percent or two—something is not as you would expect with your compressor or gas or steam turbine, and you want to figure out what’s wrong. So, you perform a root-cause analysis based on certain symptoms. In most cases, this starts with a review of the machine’s operation that may have preceded the failure and in some cases a metallurgical investigation, which is one of the best indicators as to what failed. It can indicate high-cycle fatigue, sudden overload, or corrosion or erosion, for example. Then you figure out what excitation may have caused the high-cycle fatigue or what in your process gas may have caused corrosion or erosion.
Every root-cause failure analysis, at its core, is a human failure. Yes, you can claim that somebody designed it wrong or picked the wrong metal, but these are meaningless conclusions. Your conclusion needs to be actionable: this is what’s wrong and this is what needs to be fixed to avoid this failure again. But not every analysis will provide a clear answer.
Kurz: I want to emphasize why this is called root-cause failure analysis: The key is that you don’t just look at what immediately caused the failure, but rather the underlying causes so you can avoid failures in the future. This is where it gets fuzzy because not every failure has a clear reason; it’s sometimes an effect of numerous things. A systematic root-cause failure analysis creates a plan to identify the issue that ultimately caused the failure.
Brun: That's important because very few failures contribute to a single point. It’s also important to dig into it and not just say, “A blade failed, I see high-cycle fatigue” when it could be due to bad manufacturing—maybe there was a crack in the blade caused by stress corrosion. Usually, that is in combination with excitation mechanisms like aerodynamic excitation, rotor-dynamic excitation, and blade-dynamic excitation. You must figure out what’s unusual because you need to eliminate it—what can I do to fix the problem?
As Rainer stated, unfortunately, it's not always that clear. Remember Murphy's Law: Something that can go wrong, will go wrong. So, sometimes you’re just unlucky. But you have to be careful not to over-constrain as a result of a failure because very often the root-cause failure analysis figures something out and then you establish new rules, standards, and requirements. That can also be overdone, as you may end up with too many requirements.
Kurz: Without a key understanding of what happened and the likelihood of a failure happening again, it doesn't make sense to invoke massive design changes that may impact the performance of the equipment in other areas. There’s a tendency in the industry to identify a failure, discuss it, find the root cause, and then people write rules to eradicate an unlikely event at the cost of the design community.
Brun: Root-cause failure analysis is a multidisciplinary approach, so you need materials engineers, rotor-dynamicists, and aerodynamicists to identify an unknown cause. If you just use one rotor-dynamicist, you might only find a rotor-dynamics issue—if you use a materials guy, he will find something that's materials-related.
As an industry over the last 70 - 80 years, rather than trying to understand the physics of what happened, we just make up new design rules. These design requirements and design rules make it into our industry standards. API standards, API 616 for gas turbines, for example, has grown from 30 - 40 pages in the 1960 - 70s to now 300 pages. It's unmanageable as nobody reads or understands it anymore—we’ve moved from standards to specifications. It’s eliminated flexibility in the design process for new machines. So, you have to be careful when you conduct a root-cause failure analysis so as not to over-constrict based on findings for future operations and designs.
TURBO: Can you break down the various methods used for root-cause failure analysis?
Brun: There are two ways of looking at it: You either start broad and work to the inside or start on the inside and work to the outside. The first method starts broad and assembles a multidisciplinary team that lists all the possible failure mechanisms and combinations of failure mechanisms—the Fishbowl approach. The other approach starts on the inside; for example, you identify the symptom as blade failure and begin with a materials analysis to reveal high-cycle fatigue. Then, you probe about the potential causes for high-cycle fatigue, which could be aerodynamic- or erosion-induced. Why is it aerodynamically induced? Maybe you have unstable combustion in the gas turbine or some blockage in the guideline.
Those are the two approaches and many charts and graphics are generated during the process. If you have a good team of people, you can find the failure either way.
Kurz: In a way, this is a bit like a detective story. You find something failed, aka the dead body, and then you gather evidence. One of the key things in root-cause failure analysis is that you not only gather evidence but also document it. One of the key problems when these reports are written is you have a failure in 2015, for example, and you conduct an analysis, and the next similar failure isn’t for another 10 years. (These events don’t happen that often.) And if you haven't documented what you did for the first one, you can’t use it as a starting point.
Brun: We even call it forensics. There are guidelines for machinery failure forensics to help preserve evidence, such as when you collect blades, don't clean dirty surfaces, and properly store all your pieces. Just as Rainer said in the murder investigation, if you destroy the evidence, there's no way to get it back. Let's say you think there’s erosion on the blade but you cleaned the leading edge, then it won't have that evidence anymore. Once the failure happens, you need a sterile area where your team can collect the evidence.
It's the same with operational data. Ask your operators if they’ve done anything in the last week, days, months, or years different from normal. Look at the historical trend of your data acquisition and control systems to check for higher vibrations or temperatures from your instrumentation. Historical data availability is valuable for looking for differences in your root-cause failure analysis, but there are cases without an answer, and you can't force it.
Kurz: When a machine shuts down, what usually happens is maintenance starts to disassemble the machine or try certain fixes. The problem is when the people who actually do the failure analysis finally get around to it, a lot of the findings are already destroyed by disassembling the machine in, usually, an expedient manner. There is a bit of a cultural thing within companies on how you first approach certain failures, so there is a chance of doing a reasonable failure analysis.
Brun: I recommend most operators take a class on root-cause failure analysis. The tendency, if something fails, is to jump on it, rip it apart, and fix it as quickly as possible. But that's not going to help if you have that same failure again three months later. So, you need to understand the failure cause, take more time, and assemble a team of experts in various fields. I mentioned aerodynamicists, but also someone in control who can read vibrations is important. Sometimes it takes a little longer and it becomes costly, but nothing is as costly as another failure, right?
Even if you conduct root-cause failure analysis and it costs a couple hundred thousand dollars, that's still not as costly as replacing a machine that costs several million dollars. It is also worthwhile to bring an independent team in there. Companies specialize in root-cause failure analysis and it’s good to bring in outsiders because they’ll examine it from a less-biased perspective. If you have your own operator look at the failure, they may not identify a failure mode that implicates themself.
Kurz: We need to disassemble the machine, so you often see somebody take it apart, leaving bits and pieces lying around. After that, you won't be able to determine how they were arranged or aligned relative to each other. It’s a good habit to take as many photos as possible before assembly and disassembly. Evidence gathering is not just the job of the root-cause failure analysis team, but much earlier with the first responders to the failure or problem.
Brun: The first thing should be a borescope inspection and properly documenting each component inside the machine. Then, as you rip it apart, take as many photos as possible and write down what you photograph. Sometimes, you look at a photograph of the blade and nobody remembers where it came from—that's not helpful. If you have blade failures in an axial compressor, you have to know what first failed to identify secondary blade failures. If [blades] in the second row failed, everything downstream may be chopped.
TURBO: Are there specific actions that should or should not be taken based on the results of root-cause analysis?
Brun: If the cause is clearly identifiable, then you should implement rules, restrictions, or design changes that keep it from happening again. For example, a coating started cracking, a material started eroding, or your fuel attacked the base material—all these give you clear methods to prevent failures. Unfortunately, in many cases, it's not that clear. In these cases, you must be careful because you can over-restrict yourself and “throw out the baby with the bathwater.” Some of this comes down to judgment calls and experience.
Kurz: This whole process is a discussion on how likely it is that the same failure happens again. In root-cause analysis, you can identify that Problems A, B, and C caused that failure. Then you have to ask yourself: What will my corrective action cost me? The discussion is very different whether you, for example, conduct root-cause failure analysis on a nuclear reactor or a pump that you bought for $30,000.
Brun: There is a severity index—if the failure is catastrophic and risks the life of people and facilities or there's potential for fire or toxic gas release, that’s a relevant and high-impact factor. You may also lose 3 - 4% efficiency on the compressor, but you can continue running sometimes. Unfortunately, if you lose that 3%, that may be an indication that something is starting to go wrong and end up being catastrophic. You must understand your machine rather than blindly acting. These are judgment calls based on real experience and the machine’s physics.
Listen to the full podcast here.