<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Basil Vandegriend: Professional Software Development &#187; reliability</title>
	<atom:link href="http://www.basilv.com/psd/blog/tag/reliability/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.basilv.com/psd</link>
	<description></description>
	<lastBuildDate>Wed, 25 Jan 2012 13:23:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Error Handling and Reliability</title>
		<link>http://www.basilv.com/psd/blog/2007/error-handling-and-reliability</link>
		<comments>http://www.basilv.com/psd/blog/2007/error-handling-and-reliability#comments</comments>
		<pubDate>Fri, 12 Jan 2007 15:00:44 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[design]]></category>
		<category><![CDATA[error handling]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2007/error-handling-and-reliability</guid>
		<description><![CDATA[I have been thinking a lot lately about how to create reliable systems. I previously examined the link between complexity and reliability. Recently, however, I have come to appreciate the impact of error handling on reliability. For the purposes of this discussion, I consider two aspects of reliability: correctness - does the application produce the [...]]]></description>
			<content:encoded><![CDATA[<p>I have been thinking a lot lately about how to create reliable systems. I previously examined the link between <a href="http://www.basilv.com/psd/blog/2006/complexity-and-reliability">complexity and reliability</a>. Recently, however, I have come to appreciate the impact of error handling on reliability. For the purposes of this discussion, I consider two aspects of reliability: <em>correctness</em> - does the application produce the correct results, and <em>uptime</em> - the length of time the software can operate without terminating due to an error. A single defect or environmental problem can impact one or both of these measures. For example, a defect in an algorithm can cause a program to calculate the wrong result, without impacting uptime. A memory leak or network outage can impact uptime without impacting correctness. A null pointer exception impacts both. The error handling strategy you choose for your system affects both the correctness and the uptime. I am familiar with three main approaches to handling errors:</p>
<ul>
<li>Ignore errors</li>
<li>Fail fast</li>
<li>Degrade gracefully</li>
</ul>
<p>The <em>ignore errors</em> approach is very simple: assume errors will not happen and ignore them. Some of you may object that this is not a 'real' error handling strategy, but considering how often I have seen it used in production systems, I cannot agree. This approach does have the benefit of maximizing uptime: even if things go wrong, the program will keep running. Of course, if the program is producing incorrect output due to these errors, then you have a problem. So this approach tends to minimize correctness. Any problems that do occur are what I call <em>silent failures</em> that go undetected, at least for a while. Unix scripts and the C programming language adopt this strategy as the default: utilities and functions have return codes to report errors, so a call that results in an error will not affect the operation of your program or script unless you have an explicit check.</p>
<p>The <em>fail fast</em> approach is also very simple: whenever an error or unexpected event happens, immediately terminate execution. This approach tends to maximize correctness, but tends to minimizes uptime, since any abnormality causes it to end. These applications tend to be brittle. The slightest problem in the environment, such as a blip in the network, can bring down the application. Modern enterprise programming languages such as Java and C# adopt this strategy through the use of exceptions. If a problem occurs, an exception is thrown which will terminate the program unless explicitly caught and dealt with.</p>
<p>The <em>degrade gracefully</em> approach combines the best of the other two approaches. It detects errors like the fail fast approach, but instead of failing immediately, it handles the error and continues execution as appropriate. It therefore maximizes the reliability of the system by maximizing both correctness and uptime. The downside of this approach is that it requires much more thought and effort to implement. No programming language I am aware of provides explicit support for this approach.</p>
<p>I was originally a strong proponent of the fail fast approach, but last year I started to appreciate the degrade gracefully approach, as I wrote in my article <a href="http://www.basilv.com/psd/blog/2006/fail-fast-or-degrade-gracefully">Fail Fast or Degrade Gracefully?</a>. Over the past year, my viewpoint has shifted further. I now feel that the degrade gracefully approach should be used by default. Only if it would require too much effort or complexity to implement should the fail fast approach be used instead. (Naturally I do not support the use of the ignore errors approach.)</p>
<p>There are many examples of the degrade gracefully approach within the IT infrastructure we rely on. TCP/IP networking stacks are designed to degrade gracefully when problems such as dropped packets occur. Web servers do not shut down if a web application experiences a failure - they instead terminate the current request by sending an error response to the client and continue to serve other requests. Email clients do not fail if the email server becomes unavailable, and more importantly the mail you were trying to send is not lost. Modern compilers do not stop upon encountering the first syntax error but instead continue parsing the same file (and other files) as best they can.</p>
<p>The validity of these examples could be debated. One could argue that some of these situations such as dropped network packets and bad user input (syntax errors in code) are expected - a normal part of operation - rather than representing an exceptional situation or error. The systems handle these situations because it is a requirement, not because they are using the degrade gracefully error handling approach. I would instead argue that the requirement is to use the degrade gracefully approach to handle these problematic situations, primarily because both the ignore errors approach and the fail fast approach are unacceptable.</p>
<p>Reliable systems do not happen by accident, but require careful thought and effort to create. The approach you choose for handling errors can have a bigger impact on reliability than you might expect.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2007/error-handling-and-reliability/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lessons Learned in 2006</title>
		<link>http://www.basilv.com/psd/blog/2006/lessons-learned-in-2006</link>
		<comments>http://www.basilv.com/psd/blog/2006/lessons-learned-in-2006#comments</comments>
		<pubDate>Mon, 18 Dec 2006 20:43:35 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[learning]]></category>
		<category><![CDATA[professional]]></category>
		<category><![CDATA[deploy]]></category>
		<category><![CDATA[error handling]]></category>
		<category><![CDATA[personal development]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[software releases]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2006/lessons-learned-in-2006</guid>
		<description><![CDATA[As a proponent of perpetual learning, I like to periodically take the time to reflect on what I have learned. Looking back at this past year, I definitely expanded my understanding in a number of areas based on my experiences at work and at home. My most significant growth was in the area of personal [...]]]></description>
			<content:encoded><![CDATA[<p>As a proponent of <a href="http://www.basilv.com/psd/blog/2006/perpetual-learning">perpetual learning</a>, I like to periodically take the time to reflect on what I have learned. Looking back at this past year, I definitely expanded my understanding in a number of areas based on my experiences at work and at home.</p>
<p>My most significant growth was in the area of personal productivity: I read and implemented the organizational system described in the book <a href="http://www.amazon.ca/exec/obidos/redirect?link_code=as2&#038;path=ASIN/0743520343&#038;tag=basilvandegri-20&#038;camp=15121&#038;creative=330641">Getting Things Done : The Art Of Stress-Free Productivity</a><img src="http://www.assoc-amazon.ca/e/ir?t=basilvandegri-20&#038;l=as2&#038;o=15&#038;a=0743520343" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> by David Allen. I have found this system very useful both at work and at home, and wrote an <a href="http://www.basilv.com/psd/blog/2006/getting-things-done">article describing my experience implementing the system</a>.</p>
<p>This last year I switched to a <a href="http://www.basilv.com/psd/blog/2006/working-four-days-a-week">four day work week</a>, and have found it to be a significant improvement over the normal five days. I even figured out how to do this without a drop in pay.</p>
<p>One area of software development I have explored a lot this year has been change and release management: specifically the process of promoting application or infrastructure changes into a production environment. Both my experiences at work and <a href="http://www.basilv.com/psd/blog/2006/running-my-website-3-month-retrospective">my experiences running this website</a> have made me appreciate the necessity of having a defined process. The amount of process that is necessary depends on the types of changes being made and the complexity of the environment. As I wrote in my article on <a href="http://www.basilv.com/psd/blog/2006/deploying-application-changes">deploying application changes</a>, it is easier when the application is packaged and deployed as a single unit. When this is not possible - like for database changes - then more process is necessary. Adding too much process, however, can be just as harmful as having too little - a proper balance must be maintained.</p>
<p>Over this last year I have spent a lot of time thinking about application reliability. Ensuring a system is highly reliable is a surprisingly difficult task, and one of the major culprits is <a href="http://www.basilv.com/psd/blog/2006/complexity-and-reliability">complexity</a>. I have found myself arguing more strongly for simpler, more reliable solutions as a result. My most recent focus has been on error handling.  My article on <a href="http://www.basilv.com/psd/blog/2006/fail-fast-or-degrade-gracefully">the fail fast and degrade gracefully approaches to error handling</a> contains some earlier thoughts on this subject, but I have not yet written up my latest ideas. I have come to believe that the degrade gracefully approach, while more difficult to implement, is the best for creating highly reliable software. I have unfortunately encountered a third approach to error handling: ignore errors. This leads to silent, undetected failures whose negative effects go unnoticed for an arbitrary length of time.</p>
<p>I have also learned a <a href="http://www.basilv.com/psd/blog/2006/running-my-website-3-month-retrospective">number of lessons from running this website</a>, including improving my knowledge of web design, promotion, and on-line advertising. I have found the process of writing my weekly articles to be a great learning aid. Writing forces me to reflect on and clarify my thoughts and ideas on a particular subject. It is no coincidence that I have provided a number of links above to articles I have written this past year. I tend to write about topics that are currently in my focus of attention, and the act of writing about them helps clarify and solidify the learning I have done. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2006/lessons-learned-in-2006/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Complexity and Reliability</title>
		<link>http://www.basilv.com/psd/blog/2006/complexity-and-reliability</link>
		<comments>http://www.basilv.com/psd/blog/2006/complexity-and-reliability#comments</comments>
		<pubDate>Thu, 14 Sep 2006 15:00:32 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[design]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[complexity]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2006/complexity-and-reliability</guid>
		<description><![CDATA[Unrestrained complexity is a critical limiting factor in producing working software. The more complex a system, the more it will cost to create and operate and the less reliable it will be. Yet the bane of complexity is largely ignored by the IT industry. Software vendors, competing on the basis of feature sets, are constantly [...]]]></description>
			<content:encoded><![CDATA[<p>Unrestrained complexity is a critical limiting factor in producing working software. The more complex a system, the more it will cost to create and operate and the less reliable it will be. Yet the bane of complexity is largely ignored by the IT industry. Software vendors, competing on the basis of feature sets, are constantly enhancing their existing products and introducing new, more capable ones. IT consultants trying to win more work are constantly pitching ideas for new systems, new business solutions, and new capabilities. Customers are constantly asking for new or enhanced functionality. Software developers thrive on creating this functionality. These forces all lead towards greater complexity. No one benefits from fighting complexity, so its harmful effects are not publicized.</p>
<p>Actually, my last sentence is not true. Customers do want software that works, and since simpler software is more reliable, they benefit from fighting complexity. Unfortunately, the costs of complexity are largely hidden from the sight of the customer, so they seldom realize the cost involved in asking for more features. They just get upset when the software stops working or works poorly, and they do not appreciate their contribution to the problem. IT operational staff also benefit from fighting complexity to keep systems as reliable as possible, since they need to keep them running. But they seldom have much if any influence upon the procurement or development of these systems.</p>
<p>Lately I have been struggling with improving the reliability of a particular system. As the team has identified and tried to resolve various issues, I have come to see that the high complexity of the system overshadows our efforts. Why does complexity so strongly affect reliability? I like using a mechanical analogy: the more moving parts in a device, the higher the probability that one of them will fail within a fixed time, thus lowering the overall reliability of the device. In an IT system, the failure points are different. The actual physical devices - the hardware - is ironically simpler to manage since it is easy to improve through redundancy. It is the software that is the problem. The greater the complexity of the software, the higher the likelihood of defects - not just within the application code itself, but also in the overall software stack that is used. For an enterprise business application, this typically includes third party libraries, application server, web server, database server, and operating system, and can include additional services such as email, scheduling or messaging. A defect anywhere in the stack can cause the application to fail.</p>
<p>The problem with software reliability goes beyond just defects. In an enterprise setting, applications experience a wide variety of changes, each of which represents an opportunity for failure. Each of these changes is in essence a "moving part", even though the actual code for the application has not changed. The most typical change is enhancements to the application, which can introduce new defects in both the new and existing functionality. Other examples of changes include upgrades to application servers, web servers, database servers, operating systems, or hardware, configuration changes to systems such as email, network addresses, or scheduling, or security changes such as password expiration. The more complex the system, the more of these changes it experiences, which increases the chance of failure.</p>
<p>The relationship between complexity and reliability can be modeled statistically. I will represent an IT system as a collection of pieces (P) that each has a chance of failure (F), expressed as a probability of failing within one year. I think of each piece as abstractly representing something that can failure - the equivalent of that moving part in a mechanical device. This correlates with the complexity of the system. While it is hard to determine even approximate values for these measures in a real system, just using abstract concepts and figures can provide an appreciation for the relationship between the two values. The probability of the system having no failures in one year is (1-F)<sup>P</sup>. Using baseline values of 100 pieces and a 0.01 probability of failure for each piece in the year (1%), the chance of no failures in a year is only 37%. This means the chance of having one or more failures is 63%. What happens as the complexity increases? </p>
<table class="fancy" cellspacing="0">
<tr>
<th># of Pieces (P)</th>
<th> % Chance of failure per piece (F)</th>
<th>Overall % chance of no failures</th>
</tr>
<tr>
<td>100</td>
<td>1%</td>
<td>37%</td>
</tr>
<tr>
<td>200</td>
<td>1%</td>
<td>13%</td>
</tr>
<tr>
<td>500</td>
<td>1%</td>
<td>0.7%</td>
</tr>
</table>
<p>The reliability of the system falls quickly as the number of pieces is increased. In order to maintain the same reliability when the complexity doubles, the reliability of each piece must double.</p>
<table class="fancy" cellspacing="0">
<tr>
<th># of Pieces (P)</th>
<th> % Chance of failure per piece (F)</th>
<th>Overall % chance of no failures</th>
</tr>
<tr>
<td>100</td>
<td>1%</td>
<td>37%</td>
</tr>
<tr>
<td>200</td>
<td>0.5%</td>
<td>37%</td>
</tr>
<tr>
<td>500</td>
<td>0.2%</td>
<td>37%</td>
</tr>
</table>
<p>In practice, however, more complex systems are harder to understand and change, thus reducing the reliability of each change that is made. Once a system does fail, greater complexity means that it is often harder to diagnose and fix the problem. This makes the downtime longer. Complexity therefore also leads to more serious failures.</p>
<p>Complexity and reliability are closely connected. If you have no plan to manage the complexity of a system, then you may be unpleasantly surprised by what happens to its reliability. Since our goal as professionals is to provide software that works, thinking about complexity and reliability is a necessity.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2006/complexity-and-reliability/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

