«    »

Error Handling and Reliability

I have been thinking a lot lately about how to create reliable systems. I previously examined the link between complexity and reliability. Recently, however, I have come to appreciate the impact of error handling on reliability. For the purposes of this discussion, I consider two aspects of reliability: correctness - does the application produce the correct results, and uptime - the length of time the software can operate without terminating due to an error. A single defect or environmental problem can impact one or both of these measures. For example, a defect in an algorithm can cause a program to calculate the wrong result, without impacting uptime. A memory leak or network outage can impact uptime without impacting correctness. A null pointer exception impacts both. The error handling strategy you choose for your system affects both the correctness and the uptime. I am familiar with three main approaches to handling errors:

  • Ignore errors
  • Fail fast
  • Degrade gracefully

The ignore errors approach is very simple: assume errors will not happen and ignore them. Some of you may object that this is not a 'real' error handling strategy, but considering how often I have seen it used in production systems, I cannot agree. This approach does have the benefit of maximizing uptime: even if things go wrong, the program will keep running. Of course, if the program is producing incorrect output due to these errors, then you have a problem. So this approach tends to minimize correctness. Any problems that do occur are what I call silent failures that go undetected, at least for a while. Unix scripts and the C programming language adopt this strategy as the default: utilities and functions have return codes to report errors, so a call that results in an error will not affect the operation of your program or script unless you have an explicit check.

The fail fast approach is also very simple: whenever an error or unexpected event happens, immediately terminate execution. This approach tends to maximize correctness, but tends to minimizes uptime, since any abnormality causes it to end. These applications tend to be brittle. The slightest problem in the environment, such as a blip in the network, can bring down the application. Modern enterprise programming languages such as Java and C# adopt this strategy through the use of exceptions. If a problem occurs, an exception is thrown which will terminate the program unless explicitly caught and dealt with.

The degrade gracefully approach combines the best of the other two approaches. It detects errors like the fail fast approach, but instead of failing immediately, it handles the error and continues execution as appropriate. It therefore maximizes the reliability of the system by maximizing both correctness and uptime. The downside of this approach is that it requires much more thought and effort to implement. No programming language I am aware of provides explicit support for this approach.

I was originally a strong proponent of the fail fast approach, but last year I started to appreciate the degrade gracefully approach, as I wrote in my article Fail Fast or Degrade Gracefully?. Over the past year, my viewpoint has shifted further. I now feel that the degrade gracefully approach should be used by default. Only if it would require too much effort or complexity to implement should the fail fast approach be used instead. (Naturally I do not support the use of the ignore errors approach.)

There are many examples of the degrade gracefully approach within the IT infrastructure we rely on. TCP/IP networking stacks are designed to degrade gracefully when problems such as dropped packets occur. Web servers do not shut down if a web application experiences a failure - they instead terminate the current request by sending an error response to the client and continue to serve other requests. Email clients do not fail if the email server becomes unavailable, and more importantly the mail you were trying to send is not lost. Modern compilers do not stop upon encountering the first syntax error but instead continue parsing the same file (and other files) as best they can.

The validity of these examples could be debated. One could argue that some of these situations such as dropped network packets and bad user input (syntax errors in code) are expected - a normal part of operation - rather than representing an exceptional situation or error. The systems handle these situations because it is a requirement, not because they are using the degrade gracefully error handling approach. I would instead argue that the requirement is to use the degrade gracefully approach to handle these problematic situations, primarily because both the ignore errors approach and the fail fast approach are unacceptable.

Reliable systems do not happen by accident, but require careful thought and effort to create. The approach you choose for handling errors can have a bigger impact on reliability than you might expect.

If you find this article helpful, please make a donation.

«    »