«    »

Complexity and Reliability

Unrestrained complexity is a critical limiting factor in producing working software. The more complex a system, the more it will cost to create and operate and the less reliable it will be. Yet the bane of complexity is largely ignored by the IT industry. Software vendors, competing on the basis of feature sets, are constantly enhancing their existing products and introducing new, more capable ones. IT consultants trying to win more work are constantly pitching ideas for new systems, new business solutions, and new capabilities. Customers are constantly asking for new or enhanced functionality. Software developers thrive on creating this functionality. These forces all lead towards greater complexity. No one benefits from fighting complexity, so its harmful effects are not publicized.

Actually, my last sentence is not true. Customers do want software that works, and since simpler software is more reliable, they benefit from fighting complexity. Unfortunately, the costs of complexity are largely hidden from the sight of the customer, so they seldom realize the cost involved in asking for more features. They just get upset when the software stops working or works poorly, and they do not appreciate their contribution to the problem. IT operational staff also benefit from fighting complexity to keep systems as reliable as possible, since they need to keep them running. But they seldom have much if any influence upon the procurement or development of these systems.

Lately I have been struggling with improving the reliability of a particular system. As the team has identified and tried to resolve various issues, I have come to see that the high complexity of the system overshadows our efforts. Why does complexity so strongly affect reliability? I like using a mechanical analogy: the more moving parts in a device, the higher the probability that one of them will fail within a fixed time, thus lowering the overall reliability of the device. In an IT system, the failure points are different. The actual physical devices - the hardware - is ironically simpler to manage since it is easy to improve through redundancy. It is the software that is the problem. The greater the complexity of the software, the higher the likelihood of defects - not just within the application code itself, but also in the overall software stack that is used. For an enterprise business application, this typically includes third party libraries, application server, web server, database server, and operating system, and can include additional services such as email, scheduling or messaging. A defect anywhere in the stack can cause the application to fail.

The problem with software reliability goes beyond just defects. In an enterprise setting, applications experience a wide variety of changes, each of which represents an opportunity for failure. Each of these changes is in essence a "moving part", even though the actual code for the application has not changed. The most typical change is enhancements to the application, which can introduce new defects in both the new and existing functionality. Other examples of changes include upgrades to application servers, web servers, database servers, operating systems, or hardware, configuration changes to systems such as email, network addresses, or scheduling, or security changes such as password expiration. The more complex the system, the more of these changes it experiences, which increases the chance of failure.

The relationship between complexity and reliability can be modeled statistically. I will represent an IT system as a collection of pieces (P) that each has a chance of failure (F), expressed as a probability of failing within one year. I think of each piece as abstractly representing something that can failure - the equivalent of that moving part in a mechanical device. This correlates with the complexity of the system. While it is hard to determine even approximate values for these measures in a real system, just using abstract concepts and figures can provide an appreciation for the relationship between the two values. The probability of the system having no failures in one year is (1-F)P. Using baseline values of 100 pieces and a 0.01 probability of failure for each piece in the year (1%), the chance of no failures in a year is only 37%. This means the chance of having one or more failures is 63%. What happens as the complexity increases?

# of Pieces (P) % Chance of failure per piece (F) Overall % chance of no failures
100 1% 37%
200 1% 13%
500 1% 0.7%

The reliability of the system falls quickly as the number of pieces is increased. In order to maintain the same reliability when the complexity doubles, the reliability of each piece must double.

# of Pieces (P) % Chance of failure per piece (F) Overall % chance of no failures
100 1% 37%
200 0.5% 37%
500 0.2% 37%

In practice, however, more complex systems are harder to understand and change, thus reducing the reliability of each change that is made. Once a system does fail, greater complexity means that it is often harder to diagnose and fix the problem. This makes the downtime longer. Complexity therefore also leads to more serious failures.

Complexity and reliability are closely connected. If you have no plan to manage the complexity of a system, then you may be unpleasantly surprised by what happens to its reliability. Since our goal as professionals is to provide software that works, thinking about complexity and reliability is a necessity.

If you find this article helpful, please make a donation.

«    »