Citizens of the province of Alberta, Canada experienced a rare event this summer when a major data center in the City of Calgary experienced a multi-day outage due to a fire. Many I.T. services provided by the Government of Alberta and taxpayer-funded organizations such as Alberta Health Services were hosted in this data center and thus suffered outages.
I have written previously about the importance of good communication in avoiding outrage over outages. In that article I decomposed outage communication into two pieces: first, reporting on the incident, and second reporting on the underlying problems or root causes. Today I want to focus on the latter category of problem communication and highlight major deficiencies in the relevant news releases published by the Alberta Government concerning this event.
To start, let us look at an example of excellent problem communication - Amazon's report concerning their Eastern U.S. data center outage, which occurred only a few weeks prior to the Alberta outage. The first thing to notice about this communication is that it is long - approximately 2700 words. The report starts by describing in detail the events leading up to the outage. Next Amazon provides their analysis of the faulty hardware and the steps they are taking in both the short and long term to address the issues they experienced with the hardware. Amazon then lists the various services provided by this data center and summarizes the impact to customers. This is a preamble to the rest of the report that then goes through each service one by one, describes the timeline and events leading to restoration of the service. As part of this detailed recount, Amazon describes various problems they encountered such as software bugs or design flaws and reports on what they are planning to do to correct or improve the situation. In aggregate across these service sections, Amazon provides details on two software bugs, two design flaws, and two other opportunities for improvements and indicate they have work underway to address all of these issues. Amazon then concludes with an apology for the disruption and a commitment to learn from and make improvements based on this outage. Based on this detailed analysis, one might expect that Amazon needed a lot of time to prepare this report. However, surprisingly, the timeline reported by Amazon indicates otherwise. The data center outage occurred on a Friday night at 8:04pm. Restoration of services continued into Saturday morning. The staff working on the various issues likely worked all Friday night and either would have been exhausted from working an extended day, or would be coming off an eight hour shift if they started just before the outage. Plus it was the weekend, which I presume would lead to lower than normal staffing levels. Yet despite this, Amazon was able to come out with this detailed problem communication on Monday, immediately after this weekend event. To me, this is one of the most notable aspects because it highlights how quickly Amazon was able to do this thorough analysis involving multiple specialties / technologies (hardware plus each of the four services), pull it all together, and publish it only two days later despite the weekend. This says a lot about how seriously they took this situation.
Now let us examine what was published by the Alberta government for problem communication regarding the Calgary data center outage, and we will see the stark contrast. The government published four news releases regarding the outage: July 12, July 13, July 15, and July 17. This last July 17 news release reported that all services were "now fully restored", so one would expect to see problem communication provided within this release. And in fact this is the case. Dissecting this news release paragraph by paragraph reveals the following:
- The first paragraph states that the remaining services have been restored. No issues here.
- The second paragraph contains an acknowledgement of impact to Albertans, using phrases like "... a frustrating inconvenience for many Albertans", and "I appreciate their patience and understanding...". This should have been an apology. In fact, reviewing all four news releases indicates that the government never once apologized for the disruption. In contrast here is the first sentence of Amazon's apology: "We apologize for the inconvenience and trouble this caused for affected customers." Reading the Amazon apology paragraph, I am left with the impression that they truly do care about their customers, take full responsibility for what happened, and are committed to improving. The impression I get from the government statement is quite different: they avoid accepting any responsibility.
- The third paragraph discusses the post-incident ramifications regarding temporary provisions the government put in place. This is excellent - no issues.
The fourth and final paragraph dashes our hopes for further details as it is only three sentences long. The first two sentences describe the initial incident and timeline to restoration, but provide absolutely no root cause analysis or indications of areas to improve, unlike the Amazon report. The government's last sentence talks about learning and improving from the event, which I was happy to see. However, a closer look at the language used compared to Amazon's reveals more disappointing contrasts. Here is a key sentence from Amazon, with emphasis added, "We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes.". And here are the government statements, again with emphasis added (first sentence from the second paragraph, second sentence from the last paragraph) "Now we are focusing on what we can learn from this situation to improve our systems going forward.... Now that all the systems are restored, government will take the time to internally assess what happened and make improvements if necessary." Reading Amazon's statement, they give a concrete commitment to take action to improve in the short-term, and state that they will do more learning and improvement, which is impressive given how many details they had already determined and communicated. In contrast, the government's statements suggest that no learning has yet taken place and that improvements might not happen. Yet from examining the series of news releases it is apparent to those with an I.T. background that there are many opportunities for improvement. I have identified the following just from analyzing the information publicly available from the government's news releases:
- Services using mirrored data were restored more quickly than those having to recover from tape backups (see the July 13 and July 15 news releases). Why weren't all services using mirroring, in particular services like land titles, one of the last services restored, whose disruption had a much more significant impact on Albertans than some of the other services.
- Why did the land titles system and motor vehicle registry take two extra days to restore (see July 15 and July 17 news releases)? These were fairly critical services compared to others restored sooner such as fishing and hunting license sales that are clearly lower priority. This suggests that something went wrong in the attempt to restore these services.
- Even for services using mirrored data, it took at least one and a half days to report that these services were back up. (I cannot be more precise as to the time interval because none of the government news releases specify only dates and no times, unlike the Amazon report.) Perhaps some mirrored systems did fail over immediately and never suffered a service disruption and thus were never reported on - I cannot tell from the news releases. But for these mirrored services that were disrupted, there must be improvements that can be made to the time required to fail over.
- The government's I.T. disaster recovery plan certainly can be improved since any real disaster such as this one will provide lessons learned above and beyond what regular disaster recovery testing will identify.
One potential critique of my use of Amazon to contrast with the Alberta government is that Amazon is a large, world-class organization specializing in providing I.T. services (as well as selling items online). Perhaps one cannot expect the Alberta government to have the same level of I.T. expertise. Fortunately for me, the government itself negated this critique by stating multiple times that they are using IBM as their service provider, and in particular making statements like "We will continue to work with our partner IBM, a world leader in information technology,..." and "A dedicated, broad team of IBM experts will continue to work non-stop..." (from the July 15 news release). So in my view, the government has no excuse for their poor communication.
As a citizen and taxpayer of Alberta, I conclude by calling on my government to take accountability and step up by providing proper problem communication detailing what has been learned regarding this outage and what improvements have and will be made.
If you find this article helpful, please make a donation.