This article is a continuation of my previous article on how to do root cause analysis . As I promised, this article provides examples of root cause analysis being performed.
A famous example of root cause analysis is the presidential commission's inquiry into the 1986 US Challenger space shuttle explosion, particularly the observations of Nobel prize-winning physicist Richard Feynman. The basic finding of this investigation was that the explosion was caused by the failure of the O-ring seal in the right solid rocket motor due in part to unusually cold temperatures. They identified the problem, but did they find the root cause? The report does have a short section on the contributing cause of the accident.
Contrast this with Feynman's observations, which not once mention the O-ring seal and instead focus on deeper issues such as how NASA management evaluated shuttle reliability and safety. It was the solid rocket booster that failed, yet Feynman also investigated other major shuttle components. Feynman probed to the heart of the matter - the root cause - by not accepting limits on his why questions. He found that the engineers' warnings that it was too cold to safely launch went unheeded by management. His investigation was not without political consequences. Feynman's observations almost didn't make it into the final inquiry report - he had to fight to have it included - and it was relegated to an appendix. Feynman's final statement in his report elegantly summed up the root cause: "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."
My other example of root cause analysis comes from my own experience on a maintenance team for an operational business application. End users of this application had discovered bad data in one of the database tables in the production system. Other people on the team looked into the problem and determined that it was caused by a missing database trigger. Not missing as in forgotten to be added originally, but missing as in the trigger existed at some point in the past, but no longer did. When I learned of this situation, I started my root cause analysis by asking why the trigger had disappeared. Naturally I didn't get an answer, unless you count "I don't know". It was time to start investigating.
Triggers don't disappear by themselves. Someone had made a change to the database schema that eliminated the trigger. I doubted that someone would have explicitly deleted the trigger, if only because everyone was surprised it was gone. So it was a mistake - some other change that inadvertently eliminated the trigger. My chief suspect was a change to the underlying table. If the table was dropped and recreated for some modification, then dependent objects such as triggers would have been automatically dropped and would have needed to be recreated as part of the change. Of course, the database administrators (DBAs) on the team who make all the database changes know this. So why then would the trigger not be recreated?
I needed to find out how the DBAs normally performed table changes. A few questions later, I learned that the typical approach was to use their DBA tool to extract the DDL definition for the table and all related database objects (views, indexes, etc.), make the required changes to this DDL, then run it. I then tried this procedure for myself, selecting a table with a trigger. I used the DBA tool to extract the DDL. To my surprise, the resulting DDL did not include the trigger definition. This meant that I had found the probable root cause. While I didn't definitively know that this caused the problem, I knew it was very likely. More importantly, it was an issue that could be addressed to minimize the likelihood of triggers being dropped in the future. I notified the DBAs about my findings and this defect in the DBA tool was submitted to the company that developed it.
While I had a likely root cause, this didn't mean I was done with the root cause analysis. There were still more why questions to ask. Why wasn't this problem noticed sooner by the maintenance team before the change went into the production system? Was a proper review performed of the change? Why didn't system testing or user testing detect the missing trigger? Where any other triggers missing on production database tables? For each of these questions there is the potential for an answer that will identify how to help prevent this type of problem from reoccurring.
If you find this article helpful, please make a donation.