«    »

Fail Fast or Degrade Gracefully?

There are two approaches to handling internal application errors. In the fail fast approach you immediately terminate the operation (or even the application) once an error is detected. In the degrade gracefully approach you try to continue with as much of the operation as you can.

For quite a while I have been a firm proponent of the fail fast approach. If you encounter an internal application error (i.e. a method parameter is unexpectedly null), this is often a sign of a defect. The presence of a defect means you can no longer trust the operation of the application, so the safest approach is to terminate the operation or even the application. (In Java, this is typically done by throwing an appropriate RuntimeException.) Besides being the safer of the two approaches, another advantage of fail fast is that it forces the problem into the open, which makes it more likely it will be detected and fixed.

However, I recently came across a situation in which the degrade gracefully approach made more sense. The application in question had a generic message class for formatting messages with parametrized arguments. To use the class, you provide the message embedded with tokens representing one or more parameters, plus the parameters to be substituted for the tokens. One day I came across a use of this message class that supplied a parametrized message with the wrong number of parameters. Curious as to why this block of code had not 'died' (thrown an exception) during testing, I looked into the implementation of this message class. I discovered that the class did absolutely no checking of the arguments supplied to it. As a result, you could supply the wrong number of parameters (too many or too few), and the class would still return the formatted string, ignoring extra parameters and treating missing parameters as empty strings. A little investigation quickly revealed that there were other places in the application that were supplying the wrong number of parameters to this class.

So I refactored the message class to use the fail fast approach, then searched for usages of the class to fix the cases where the arguments were invalid. It didn't take that long before the changes were done and all the unit tests were successful, so I committed my code. Some time later someone encountered an error which I quickly recognized - an invalid argument supplied to that generic message class. Obviously, I had missed a place in the application that was calling the message class incorrectly. But the error had me think: the message class was used to format a message to be displayed to the user. Before, with the degrade gracefully approach, the users had been able to perform the operation in question successfully, despite getting a poorly constructed message. Now with the fail fast approach, we did quickly find out about the bad message, but the user could no longer complete the work they were trying to do. I wasn't happy about my change having made the application less useful for the user.

After some thought, I realized that the degrade gracefully approach was appropriate in this situation. A message to the user missing some parameters is almost always still somewhat understandable, and has nothing to do with the actual business logic being performed, so it is fairly safe to continue with constructing the message despite receiving the incorrect number of parameters. But I still wanted to be able to find out about these cases - they did represent defects (albeit minor) in the code. I really wanted the advantages from both approaches.

To achieve this, I again refactored the message class to allow it to proceed despite having the wrong number of parameters. I changed the code checking for invalid parameters to log an error to the application log instead of throwing an exception. By logging an error I ensured that the developers would find out about the problem, but the application would proceed. (You may be thinking that this error in the log file is likely to be overlooked by developers, but I had already implemented changes to ensure this wouldn't happen. I'll save the details of this for a future article.)

In most cases, I still prefer the fail fast approach. Even in this case involving the message class, if the original developers had used the fail fast approach then I suspect there would have been far fewer cases of calling code supplying the wrong number of parameters. This is a potential drawback of the degrade gracefully approach: if you are not careful, you end up hiding information about a defect. If you do decide to use the degrade gracefully approach, ensure you have a mechanism to detect and reveal any defects, rather than completely hiding them. One case where the degrade gracefully approach is often used is at the application architecture level. Applications such as web servers and business web applications that process multiple independent operations do not terminate upon encountering an internal error. Instead, the current operation is terminated with the appropriate error reported while the application continues running, able to process other requests.

If you find this article helpful, please make a donation.

One Comment on “Fail Fast or Degrade Gracefully?”

  1. I try to do something like what Michael Feather’s describes: a fail-fast core that can be wrapped for higher level functions. Starting with fail fast seems like the right thing to start with since it catches bugs before your code is mature and, let’s face it, is much easier to write.

    When I do add error handling, it’s normally via Python decorators as described on my blog.

    Another solution is a catch-all error handler that logs the error (or emails, etc.) for developer knowledge and gives the user a “error occured: watch out!” message. The handler would have to be robust (so you don’t hide errors) and flexible (since you’ll probably want to customize the error message depending on the failure, etc.). It might also end up fairly specific to the project since you don’t want to write to stdout in a web-app or try to send email from your embedded miro-controller. As such it doesn’t seem worth it for small apps.

    Of course writing the handler to be both robust–you don’t want to miss errors after all–and flexible

«    »