«    »

Troubleshooting Incidents and Blackboard Architectures

I recently helped with troubleshooting a complex, multiple-day incident that seemed to be a never-ending stream of surprises. Communication between the various teams involved and their management was a challenge, with different theories being discussed in separate discussion threads with only partially overlapping sets of participants.

After the dust settled, I took the time to reflect on these communication challenges and my approach to troubleshooting, and I came up with a key revelation: a blackboard architecture is the perfect model for troubleshooting incidents. From Wikepedia's definition:

...a common knowledge base, the "blackboard", is iteratively updated by a diverse group of specialist knowledge sources, starting with a problem specification and ending with a solution. Each knowledge source updates the blackboard with a partial solution when its internal constraints match the blackboard state. In this way, the specialists work together to solve the problem.

A blackboard architecture does not have to be realized within an I.T. system. Many detective or investigative plots in shows feature a wall on which analysts post pictures of possible suspects, locations of events, pictures of evidence, etc. and try to make sense of it. Troubleshooting incidents is essentially detective work, so it seems obvious in retrospect that a blackboard architecture would fit.

Detective Wall

My personal approach to troubleshooting fits this model. I start by studying the direct symptoms of the incident (e.g. server crashing), and then branch out to gather related observations (e.g. performance metrics, recent changes, trends). I usually identify early on at least a few candidate theories as to the cause. Each new observation is evaluated to see whether it supports or contradicts existing theories, or suggests a new theory. Observations are not always correct, either, so in fuzzy logic style I place more weight on observations that correlate with others, and discount ones that contradict. From the current set of candidate theories and observations I brainstorm experiments or additional observations to make that would be helpful in confirming or eliminating specific theories.

Applying this model to troubleshooting a complex incident involving multiple teams highlights the need for a better communication model than threads of discussion (whether implemented via email, chat, or forum). I believe it would be helpful to have an electronic version of a blackboard that allows people to post theories and observations with multimedia support to include artifacts such as tables of statistics, charts, diagrams, and links. I am not aware of any digital tools explicitly supporting this model - Google Docs seems to be the closest I can think of. In the physical world, war rooms and operation centres with large whiteboards can serve a similar purpose.

If you find this article helpful, please make a donation.

«    »