If you are a software developer and have not maintained operational applications with real users hammering away at it, then you are missing some important lessons. You might not fully appreciate the operational challenges facing the maintenance and support team, particularly when the software in question is suffering in the areas of reliability, performance, or capacity. Over my period of involvement with application maintenance, I have been amazed at the number of incidents and problems that arise when an application goes into production. That is why in the last few months I have written several articles about reliability such as Error Handling and Reliability.
Based on my experience, I thought I had a good appreciation for what can go wrong. That changed recently when I experienced a day filled with too many problems and errors to be believed. The day started off innocently enough until a member of the production support team came by to inform us that he had accidentally terminated the database connection of one of our batch jobs due to transposing two numbers in the identifier. This by fluke matched our job instead of the one he wanted. Okay, no problem, we simply need to confirm that nothing was corrupted and restart the process. We checked our email for the notification email that is sent when a batch job abnormally terminates in order to verify which job had been affected. No such email was found. A little puzzled, we checked the server and confirmed that the process was no longer running. But another batch process was executing, and we identified it as a subsequent batch job dependent on the first. Subsequent jobs only run if the predecessors execute successfully, so we had a sinking feeling as we started checking the log files. Sure enough, due to a complete lack of error handling, the first job had reported a successful execution despite the database connection failure, which had caused the second job to start. That explained the lack of a notification email. The second job depended on the processing results of the first job, so the output of the second job was suspect and likely wrong. We had to kill the second job. If the first job had just failed, we could have restarted it without a problem, but now we had to investigate how to undo the effects of both jobs and manually restart the first.
Well, that didn't seem too bad, until I had time to think for a second. That is when I realized that our batch jobs are always scheduled at night or the weekends, and never during business hours. What was one doing running during the day? That prompted another investigation, which revealed that the job does normally runs on weekends. But the previous weekend there were problems with predecessors to this job that caused it to be delayed until one of the nights during the week. So why didn't it run at night? We were surprised to discover that it had – it just didn't finish. Due to performance issues, the job had run for over eight hours, extending into the day, before it was killed by mistake.
By late afternoon of that day we had multiple investigations underway trying to track down the various root causes of the problems we had identified. My mind reached the saturation point sometime in the afternoon, so I cannot remember the details concerning what was found. I suspect there were other problems unearthed that I have since forgotten. Nor where we able to get everything fixed that same day. That combination of problems, coming together on that one day, kept a surprising number of people busy for days sorting out the mess.
If you find this article helpful, please make a donation.