<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Basil Vandegriend: Professional Software Development &#187; defects</title>
	<atom:link href="http://www.basilv.com/psd/blog/tag/defects/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.basilv.com/psd</link>
	<description></description>
	<lastBuildDate>Wed, 25 Jan 2012 13:23:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Defects &#8211; To Fix or Not to Fix</title>
		<link>http://www.basilv.com/psd/blog/2011/defects-to-fix-or-not-to-fix</link>
		<comments>http://www.basilv.com/psd/blog/2011/defects-to-fix-or-not-to-fix#comments</comments>
		<pubDate>Tue, 04 Oct 2011 13:41:11 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[quality]]></category>
		<category><![CDATA[agile]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[lean]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=703</guid>
		<description><![CDATA[To fix defects or not fix defects, that is the question: whether it is better to suffer the complaints of outraged users, or to divert effort to investigate and eliminate them. Shakespeare quotes aside, every software development project has to make decisions on how many defects to fix and which ones to leave alone prior [...]]]></description>
			<content:encoded><![CDATA[<p>To fix defects or not fix defects, that is the question: whether it is better to suffer the complaints of outraged users, or to divert effort to investigate and eliminate them. </p>
<p>Shakespeare quotes aside, every software development project has to make decisions on how many defects to fix and which ones to leave alone prior to shipping. While I have seldom seen this question debated within projects, the advice from industry thought leaders varies considerably. The Agile and Lean methods of software development in particular have somewhat opposing perspectives. </p>
<p>I believe that considering both sides of this question provides a fuller understanding of the issues and better equips us to answer appropriately. Therefore in the two sections below I explore the reasons behind both sides of the debate. </p>
<h3>To Fix</h3>
<ul>
<li>Shipping poor quality, defect-ridden code can upset users, turn away customers, and lead to a hard-to-shake bad reputation.</li>
<li>The decision that a feature is worth developing is made with the expectation that it will work correctly. So any defects found in a feature means that the feature is still incomplete until these issues are fixed.</li>
<li>Defects provide feedback regarding the development process. Each defect represents an opportunity to do a root cause analysis of what led to the defect and put countermeasures in place to prevent re-occurrence. The Lean mindset of "Stop the line" demands that new development be put on hold to fix newly discovered defects.</li>
<li>Defects introduce the risk of compounding quality problems. The impact of a defect can be more significant than initially realized. Defects can be inadvertently replicated in other parts of the system. Enhancing components with too many defects can slow progress to a halt, as the system becomes essentially a shifting quicksand that is too unstable to work on. Constantly fixing defects helps maintain a high velocity of development over time.</li>
<li>To mitigate risks in not fixing defects, each defect needs to be analyzed to understand its impact, cause, and required changes to fix. But after performing this analysis most of the work is usually done - the fix is relatively straightforward. Waiting to decide later to fix the defect (e.g. in a subsequent release) causes all the knowledge gained in the analysis to decay over time which is wasteful (in the Lean sense).</li>
</ul>
<h3>Not To Fix</h3>
<ul>
<li>Significantly delaying the release of software to fix all defects leads to a loss of immediate revenue and potentially loss of market share due to competitors beating you to market. So you cannot afford to wait to fix all defects.</li>
<li>The entrepreneurial mindset, especially for startups, is to ship early to get feedback from paying customers. Perfection is the enemy of the good.</li>
<li>Under at least some versions of Scrum, defects are considered new tasks that are added to the product backlog to be prioritized by the product owner. This prioritization is based on the defect's impact (severity and likelihood of occurrence) and the effort required to fix it. Many minor defects will therefore likely never be fixed as new functionality will typically be of higher value.</li>
<li>Stopping to analyze and fix defects disrupts developers who are in the middle of working on other functionality and is wasteful.</li>
<li>Fixing defects in functionality that is already otherwise finished development and testing will require additional regression testing. Not fixing now and waiting until enhancements to this functionality are needed minimizes the extra effort required.</li>
</ul>
<h3>Conclusion</h3>
<p>Shakespeare was wrong. There is actually a third perspective regarding whether or not to fix defects: avoid the question as much as possible by focusing on defect prevention. The Lean mindset of building quality in avoids all the waste associated with finding, analyzing, and fixing defects and should be our preferred approach. Only when it fails and the occasional defect is introduced do we then have to answer the question.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2011/defects-to-fix-or-not-to-fix/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How Should You Feel About Defects</title>
		<link>http://www.basilv.com/psd/blog/2011/how-should-you-feel-about-defects</link>
		<comments>http://www.basilv.com/psd/blog/2011/how-should-you-feel-about-defects#comments</comments>
		<pubDate>Tue, 12 Apr 2011 16:19:39 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[professional]]></category>
		<category><![CDATA[corporate culture]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=627</guid>
		<description><![CDATA[I have recently observed myself and others having a variety of reactions when defects are found ranging between the extremes of elation and despair. How should we feel when defects are discovered? Should this vary by role? Role-Based Attitudes I will first answer this question on a role by role basis, starting with the role [...]]]></description>
			<content:encoded><![CDATA[<p>I have recently observed myself and others having a variety of reactions when defects are found ranging between the extremes of elation and despair. How should we feel when defects are discovered? Should this vary by role? </p>
<h3>Role-Based Attitudes</h3>
<p>I will first answer this question on a role by role basis, starting with the role of tester. Since one of the primary objectives and professional skills of testers is to find defects, I would expect them to generally be happy when they find problems. The more significant (e.g. critical) the defect or more tricky to find, the happier I would expect them to be. If the defects are trivially easy to find, or are so serious that they prevent further testing, then I could see testers getting upset because they are prevented from exercising their craft to the fullest level of their capabilities. </p>
<p>What about developers? When developers find defects in <em>other</em> people's code, perhaps via a code review, this is fairly similar to the tester role above so I would expect developers should generally be happy in this scenario. </p>
<p>When a developer finds a defect in their own code as part of the development process - e.g. by having an automated test that unexpected fails after a code change then how should they feel? It is only natural to feel some disappointment at having made a mistake, but I do not believe this is the best reaction. In fact, it may be harmful. If you experience mild distress or pain whenever you find your own defects, you may subconsciously start to try to avoid the pain by, for example, doing less testing. So I believe a better reaction is to be happy when you find your own defects. This may seem challenging to do, but consider these rationale:</p>
<ul>
<li>Finding the defect validates your skill at performing whatever defect detection activity you were doing - whether it was automated testing or personal code review - and reinforces the value of performing that activity.</li>
<li>Discovering this defect provides a learning opportunity on how you might better prevent or detect this type of defect in the future. Learning opportunities are good.</li>
<li>As a professional, you want to pass along high quality code. Every defect you find in your own code is one less defect others will find, thus raising the quality of your final output.
</li>
</ul>
<p>The final scenario to consider is when someone else finds defects in the developer's code. A typical example is when the developer considers a feature finished and a tester is testing it. Now this must be bad, right? Even more than the prior scenario it seems natural to feel disappointment. Each defect that is discovered is essentially downgrading the quality of what you have produced. However, some of the reasons we listed previously for feeling positive still apply. In particular if the defect was still found within the team rather than being passed along to the end user or client, then this is a good thing. It can be hard to see this as a developer when it is your code at fault. </p>
<p>So let us consider the final role: that of the team lead or manager who is supervising activity rather than specifically doing coding or testing. This role has no personal stake in individual defects, but instead is typically concerned about the broader implications to the project or product. Assuming that high quality is a key objective, how should such individuals feel about defects? One tendency is to react negatively because of the extra effort and schedule time required to fix defects. Finding too many defects can also lead to alarm bells over the level of quality. I feel that since the team lead or manager essentially represents the whole team, their reaction should correspond to the perspective of the entire team.</p>
<h3>Team-Based Attitudes</h3>
<p>So what should the attitude of the team as a whole be to discovering defects? To properly answer this question we need to know what the team's objective is. I am going to assume it is to <a href="http://www.basilv.com/psd/blog/2008/our-mission-as-software-developers">build working software that is being used and meeting users' needs</a>. The <em>existence</em> of defects is clearly bad since they threaten this objective. But what about the <em>discovery</em> of defects? Discovering a defect means the team gains more information about the state of the software which it will typically use to improve the software by fixing the defect, and which the team can also use as feedback to improve. </p>
<p>Discovered defects exist prior to being found, but the team does not know about them until the time of discovery. This means that the team should experience simultaneous conflicting reactions: happy that the defect was discovered, yet unhappy that the defect exists.</p>
<h3>Individual Attitudes</h3>
<p>So what should the attitude of the various individuals on a team be towards defects? Should it be based on their role? I believe that in an ideal team every individual will understand and adopt the common objectives of the team. This means that everyone's attitude towards defects should ideally be the same, no matter what the role. </p>
<p>As I have observed, in practice attitudes tend to be quite far from this ideal state. People maintain their own personal objectives and viewpoint in addition to or instead of the team's. Often they do not appreciate the distinction between discovering defects and the existence of defects. To correct this I believe team members need to be clear on the team's objectives and understand how the existence and discovery of defects impacts these objectives. I hope this article will help in that regard.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2011/how-should-you-feel-about-defects/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Predicting and Evaluating Defect Levels</title>
		<link>http://www.basilv.com/psd/blog/2011/predicting-and-evaluating-defect-levels</link>
		<comments>http://www.basilv.com/psd/blog/2011/predicting-and-evaluating-defect-levels#comments</comments>
		<pubDate>Tue, 18 Jan 2011 14:00:15 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[quality]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[estimate]]></category>
		<category><![CDATA[metrics]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=592</guid>
		<description><![CDATA[Is it possible to predict how many defects will be encountered in acceptance test or production? What number of defects would be considered reasonable versus signs of low or high quality? These are questions I considered when my last project entered acceptance test. At the time I had no good answers. So over the past [...]]]></description>
			<content:encoded><![CDATA[<p>Is it possible to predict how many defects will be encountered in acceptance test or production? What number of defects would be considered reasonable versus signs of low or high quality? These are questions I considered when my last project entered acceptance test. At the time I had no good answers. So over the past months I have been searching for information on defect levels and quality metrics that could help answer these question. The most useful source I have found is <a href="http://en.wikipedia.org/wiki/Capers_Jones">Caper Jones</a>, a researcher and consultant on formal software estimation.<br />
Caper has written a number of articles and books in which he provides metrics on defect levels based on benchmarks derived from literally thousands of projects. The ones I found most useful were:</p>
<ul>
<li><a href="http://www.rbcs-us.com/images/documents/Measuring-Defect-Potentials-and-Defect-Removal-Efficiency.pdf">Measuring Defect Potentials and Defect Removal</a> (pdf)</li>
<li><a href="http://www.amazon.ca/gp/product/0201485427?ie=UTF8&#038;tag=basilvandegri-20&#038;linkCode=as2&#038;camp=15121&#038;creative=330641&#038;creativeASIN=0201485427">Software Assessments, Benchmarks, and Best Practices</a><img src="http://www.assoc-amazon.ca/e/ir?t=basilvandegri-20&#038;l=as2&#038;o=15&#038;a=0201485427" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
</li>
<li><a href="http://www.amazon.ca/gp/product/0071483004?ie=UTF8&#038;tag=basilvandegri-20&#038;linkCode=as2&#038;camp=15121&#038;creative=330641&#038;creativeASIN=0071483004">Estimating Software Costs: Bringing Realism to Estimating</a><img src="http://www.assoc-amazon.ca/e/ir?t=basilvandegri-20&#038;l=as2&#038;o=15&#038;a=0071483004" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
</li>
</ul>
<p>The information I provide in the remainder of this article comes primarily from these references.</p>
<h3>Estimating Software Size</h3>
<p>The first step in Caper's approach is to estimate the size of the software. His preferred size metric is <a href="http://en.wikipedia.org/wiki/Function_point">function points</a>, which are a language/technology neutral evaluation of the business functionality based primarily on an assessment of program inputs, program outputs, and data storage. Another common metric is logical lines of code, often abbreviated as SLOC. KLOC represents 1000 lines of code. </p>
<p>Lines of code are a convenient metric in that they can be measured automatically with a tool, whereas function points required a trained function point counter. Nevertheless, function points are superiour in several key ways:</p>
<ul>
<li>Roughly fifty percent of defects are due to problems with requirements or design rather than coding, for which metrics in terms of lines of code do not make much sense. In particular, different implementations of the same functionality can have significant variances in the lines of code required by up to a factor of four. A longer implementation will have fewer requirement or design defects per KLOC and thus artificially seem to have higher quality, when in reality the code is simply bloated. (This is related to the issue of measuring developer productivity by lines of code produced.)</li>
<li>The use of multiple languages complicates source code counts, and this is far more common than one may expect. Even a simple web application typically includes JavaScript, HTML, CSS, and perhaps SQL in addition to the primary language (e.g. Java). Function points are language-independent.</li>
<li>Function points can be determined once the design is known - lines of code require waiting till coding be completed. This allows function points to be useful for planning and estimation purposes much earlier.</li>
</ul>
<p>Since Caper prefers function points, the defect metrics he provides in his writings are typically expressed using this measure. While I understood the reasons, I found these metrics difficult to apply because I had no idea of what the function point counts were of the software I was working on. So I was pleased to discover that benchmarks such as <a href="http://www.qsm.com/?q=resources/function-point-languages-table/index.html">this one</a> exist to convert between function points and lines of code for various languages. One function point on average is roughly 50 logical lines of Java code. I use this conversion factor in the sections below.</p>
<h3>Defect Potential</h3>
<p>All software has the potential of having defects. Defect potential is a measurement of the expected number of defects in a particular piece of software. This is also called the injection rate - the number of defects being introduced throughout development. The primary factor determining the number of defects is the size of the application in function points. The maturity - experience, skill, and attention to quality - of the development team is another key factor in determining defect potential. The following table specifies how to calculate the expected number of defects given these two factors.</p>
<table class="fancy" cellspacing="0">
<tr>
<th>Maturity Level</th>
<th>Defects / Function Point</th>
<th>Lines of Code / Defect</th>
</tr>
<tr>
<td>Worst Organizations</td>
<td>9</td>
<td>6</td>
</tr>
<tr>
<td>Average Organizations</td>
<td>5</td>
<td>10</td>
</tr>
<tr>
<td>Best Organizations</td>
<td>2</td>
<td>25</td>
</tr>
</table>
<p>Some key defect prevention activities contributing to reduced defect potential are:</p>
<ul>
<li>Close customer collaboration during requirements / design (e.g. JAD sessions)</li>
<li>Prototyping</li>
<li>Feedback / learning from design and code reviews</li>
</ul>
<p>The above metrics assume a linear relationship between defects and size, but as a system gets larger there are typicaly more interactions between pieces and more complexity, and thus a greater likelihood of defects than a linear increase would suggest. For very large systems, a more accurate metric is as follows: number of defects = function points raised to an exponent. For average organizations, use 1.25 as the exponent. (Good organizations can lower this to 1.15, while poor-performing organizations have this elevated to 1.35.)</p>
<p>Defects can be categorized by origin - the type of activity that produced the defect. The table below shows this breakdown for average organizations.</p>
<table class="fancy" cellspacing="0">
<tr>
<th>Defect Origin</th>
<th>Defects / Function Point</th>
<th>Lines of Code / Defect</th>
<th>Percentage of Total</th>
</tr>
<tr>
<td>Requirements</td>
<td>1</td>
<td>50</td>
<td>20%</td>
</tr>
<tr>
<td>Design</td>
<td>1.25</td>
<td>40</td>
<td>25%</td>
</tr>
<tr>
<td>Coding</td>
<td>1.75</td>
<td>29</td>
<td>35%</td>
</tr>
<tr>
<td>Document</td>
<td>0.6</td>
<td>83</td>
<td>12%</td>
</tr>
<tr>
<td>Bad Fixes</td>
<td>0.4</td>
<td>125</td>
<td>8%</td>
</tr>
</table>
<h3>Defect Removal</h3>
<p>Defect removal is the identification and elimination of defects after they are introduced. The cumulative defect removal rate or defect removal efficiency of a development project is calculated as the number of defects eliminated prior to the release to production divided by the total number of defects found after 90 days of production use.</p>
<p>The following table shows how the defect removal rate varies with the maturity level of the team, just like defect potential, and shows the expected number of post-release defects based on the defect potential and removal metrics.</p>
<table class="fancy" cellspacing="0">
<tr>
<th>Maturity Level</th>
<th>Defect Removal Rate</th>
<th>Post-Release Defects / Function Point</th>
<th>Lines of Code / Post-Release Defect</th>
</tr>
<tr>
<td>Worst Organizations</td>
<td>60%</td>
<td>3.6</td>
<td>13</td>
</tr>
<tr>
<td>Average Organizations</td>
<td>85%</td>
<td>0.75</td>
<td>67</td>
</tr>
<tr>
<td>Best Organizations</td>
<td>95%</td>
<td>0.1</td>
<td>500</td>
</tr>
</table>
<p>Removal efficiency varies for defects of different origins, as the following table shows using statistics for average organizations.</p>
<table class="fancy" cellspacing="0">
<tr>
<th>Defect Origin</th>
<th>Defect Removal Efficiency</th>
</tr>
<tr>
<td>Requirements</td>
<td>77%</td>
</tr>
<tr>
<td>Design</td>
<td>85%</td>
</tr>
<tr>
<td>Coding</td>
<td>95%</td>
</tr>
<tr>
<td>Document</td>
<td>80%</td>
</tr>
<tr>
<td>Bad Fixes</td>
<td>70%</td>
</tr>
</table>
<p>Quality control procedures such as testing and reviews (inspections) vary in their effectiveness at removing defects as illustrated in the following table.</p>
<table class="fancy" cellspacing="0">
<tr>
<th>Quality Activity</th>
<th>Average Defect Removal Rate</th>
<th>Peak Defect Removal Rate</th>
</tr>
<tr>
<td>Requirements review</td>
<td>30%</td>
<td>50%</td>
</tr>
<tr>
<td>Design review</td>
<td>40%</td>
<td>65%</td>
</tr>
<tr>
<td>Personal review (design or code)</td>
<td>35%</td>
<td>60%</td>
</tr>
<tr>
<td>Code reviews or pair programming</td>
<td>50%</td>
<td>70%</td>
</tr>
<tr>
<td>Unit testing (automated or manual)</td>
<td>25%</td>
<td>50%</td>
</tr>
<tr>
<td>Functional testing</td>
<td>30%</td>
<td>45%</td>
</tr>
<tr>
<td>Regression testing</td>
<td>20%</td>
<td>30%</td>
</tr>
<tr>
<td>Performance testing</td>
<td>15%</td>
<td>25%</td>
</tr>
<tr>
<td>System testing</td>
<td>35%</td>
<td>50%</td>
</tr>
<tr>
<td>Acceptance testing</td>
<td>30%</td>
<td>45%</td>
</tr>
</table>
<p>Peak defect removal rates for a given activity are typically obtained only through the use of skilled, experienced staff who take a rigourous, disciplined approach to performing the activity. As an example consider unit testing. As it is normally performed it has a 25% defect removal rate. But an experienced developer following test-driven development will achieve nearly 100% code coverage and typically write better tests, leading to a higher 50% defect removal rate.</p>
<h3>Predicting Defect Levels</h3>
<p>The overall or cumulative defect removal rate for a develoment effort can be calculated by aggregating together the individual defect removal rates of the quality control procedures used by the team as follows: Cumulative removal rate = 1 - the product across all procedures of (1 - individual removal rate per procedure). </p>
<p>For example, if a team uses only unit testing (25% removal), functional testing (30%), and regression testing (20%), then the cumulative rate = 1 - (1-0.25) * (1-0.30) * (1-0.20) = 0.58 or 58%.</p>
<p>The overall number of defects in production (or UAT) can be calculated using the defect potential of the team to determine the expected number of defects introduced, and using the cumulative defect removal rate to determine the number of defects remaining. Approximately 25% of defects will be high severity.</p>
<p>For example, an average organization injecting 1 defect per 10 lines of code for a 25 KLOC application will end up introducing a total of 2500 defects. Given a cumulative defect removal rate of 95%, this means that 2375 defects will be found, leaving 125 defects remaining of which 31 can be expected to be high severity.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2011/predicting-and-evaluating-defect-levels/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Defect Prevention Practices</title>
		<link>http://www.basilv.com/psd/blog/2010/defect-prevention-practices</link>
		<comments>http://www.basilv.com/psd/blog/2010/defect-prevention-practices#comments</comments>
		<pubDate>Wed, 08 Sep 2010 13:44:40 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[productivity]]></category>
		<category><![CDATA[coding standards]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[lean]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[requirements]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=549</guid>
		<description><![CDATA[I have written numerous times about defect elimination practices such as code reviews, unit testing, and static code analysis tools. From the perspective of lean thinking, however, eliminating defects, no matter how soon after they are introduced, results in waste due to rework to fix the defects. The ideal as far as lean is concerned [...]]]></description>
			<content:encoded><![CDATA[<p>I have written numerous times about defect <em>elimination</em> practices such as <a href="http://www.basilv.com/psd/blog/2007/strategies-for-effective-code-reviews">code reviews</a>, <a href="http://www.basilv.com/psd/blog/category/unit-testing/">unit testing</a>, and <a href="http://www.basilv.com/psd/blog/2009/why-you-should-be-using-findbugs">static code analysis tools</a>. From the perspective of lean thinking, however, eliminating defects, no matter how soon after they are introduced, results in waste due to rework to fix the defects. The ideal as far as lean is concerned is to prevent defects from occurring in the first place. </p>
<p>You must be careful, however, that the cost of these defect prevention practices does not become excessive. That would introduce a different type of waste – non-value adding process. The waterfall method of software development is an example of this. One of the principles behind waterfall is that careful requirements analysis and design will minimize downstream defects during coding and testing. Put another way, it is a good idea to understand what you need to build before you start building it. The problems with waterfall arise from going to extremes in applying this principle. Requirements analysis is done up front for the entire project as a big batch based on the theory that it minimizes rework due to future change, but in reality the constant pace of requirement changes plus the learning that occurs throughout the project will result in increasing amounts of change the longer the time spent doing requirements. In contrast, Scrum and Kanban apply this principle using a balanced approach – project level requirements are done at a high level, and the more detailed analysis is done on individual user stories just prior to implementing them. (See for example the article <a href="http://agile2009.agilealliance.org/files/WHI0001%20ScrumCMMI%20from%20Good%20to%20Great%201_11.PDF">Scrum and CMMI – Going from Good to Great: are you ready-ready to be done-done?</a>.)</p>
<p>In order to effectively adopt a defect prevention practice two pieces of information are needed:</p>
<ol>
<li>Specific, actionable steps to apply the practice.</li>
<li>The expected benefit. What categories of defects does the practice intend to prevent? This helps determine when to apply the practice and helps to evaluate it after adoption to assess its effectiveness.</li>
</ol>
<p>If we consider the idea of careful requirements analysis and design mentioned above as a prevention practice, the benefits are fairly clear - prevent requirement and design based errors - but specific actionable steps are missing so it does not qualify. (In fact, this is one of the contributing factors why waterfall projects can end up in the <a href="http://en.wikipedia.org/wiki/Analysis_paralysis">analysis paralysis</a> anti-pattern.)</p>
<p>Now that the groundwork has been laid I can present some specific defect prevention practices. This is not a comprehensive list – many other practices are possible. The practices I have chosen to discuss are ones that I have used and am confident that they work. </p>
<h3>Use Understood Methods Rule</h3>
<p>The basic formulation of the rule is quite simple: when coding a method only invoke other methods whose behavior you clearly understand and are confident will work the way you want. I have written a separate article providing <a href="http://www.basilv.com/psd/blog/2009/use-understood-methods-rule">specific guidance on how to apply this rule</a>.</p>
<p>This practice generally aims to prevent interface errors - which I define generally as defects between two separate pieces of code. Research suggests that a significant proportion of defects are due to these kinds of errors (See for example the paper <a href="http://users.ece.utexas.edu/~perry/work/papers/isnd.pdf">An Empirical Study of Software Interface Faults</a>.)</p>
<p>I find this rule particularly valuable when applied to the invocation of methods between classes and especially between components. In this case it helps prevent integration errors which are usually not caught by unit testing.</p>
<h3>Design by Contract</h3>
<p>The idea of design by contract is to precisely specify the behavior of methods to help ensure that they are invoked correctly by callers and that the callers receive the results they are expecting. This is done by precisely specifying the preconditions and postconditions of methods. The chief proponent of design by contract is <a href="http://bertrandmeyer.com/">Bertrand Meyer</a>, whose book <a href="http://www.amazon.ca/gp/product/0136291554?ie=UTF8&#038;tag=basilvandegri-20&#038;linkCode=as2&#038;camp=15121&#038;creative=330641&#038;creativeASIN=0136291554">Object-Oriented Software Construction</a><img src="http://www.assoc-amazon.ca/e/ir?t=basilvandegri-20&#038;l=as2&#038;o=15&#038;a=0136291554" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> is a classic on the topic. </p>
<p>Method preconditions are conditions that must be true in order for the method to successfully execute and fulfill its postconditions. Preconditions are most commonly applied to method arguments. For example, a method to convert a string to a date might have the precondition that the string argument must not be null. Preconditions can also be applied to the state within the class or even external state. For example, a method to delete a particular object from the database might have the precondition that the object exists within the database.</p>
<p>Method postconditions are conditions that the method guarantees to be true after execution, assuming the preconditions are met. Postconditions are most commonly applied to the return value of methods, but like preconditions can also be applied to the state within the class or external state. Returning to the example of a method that converts a string to a date, such a method could have two postconditions. First, that the method will return a corresponding date object that is not null if the input is a string in a valid date format, and second that the method will throw a specified exception if the input does not correspond to a valid date format. </p>
<p>The combination of preconditions and postconditions forms in essence a contract between the method and its caller. The caller promises to fulfill the preconditions in exchange for the method guaranteeing that the postconditions will be met.</p>
<p>Despite the fact that the name of this practice contains the word "design", this approach does not require a separate up-front design of each method. The goal is to have a clear specification of behavior once the method is finished – how you arrive at it is not important to this practice. I tend to start with an initial idea for a method’s contract that I evolve as I write tests and implement the method’s logic using <a href="http://www.basilv.com/psd/blog/tag/test-driven-development">test-driven development</a>. </p>
<p>There are several options for specifying pre- and post- conditions. Some teams rely solely on their automated unit tests to serve as the specification, but I prefer a more concise specification provided as part of the method definition. In Java I typically use JavaDoc to document pre- and post- conditions and programmatically check argument preconditions at the start of the method. I typically formally specify pre- and post- conditions only on methods that are intended for use by other classes or components. In Java, this is typically public and protected methods of interfaces and classes.</p>
<p>This practice is very closely related to the Use Understood Methods Rule, and they go hand-in-hand. Knowing a method’s pre- and post- conditions is necessary to fully understanding it. As I stated above, I tend to only apply formal design by contract to methods intended for use outside the class in question, which means this practice is really aimed at preventing integration defects.</p>
<h3>Defensive Coding</h3>
<p>Defensive coding is named after the practice of <a href="http://en.wikipedia.org/wiki/Defensive_driving">defensive driving</a> and is based on the same mindset of expecting problems to occur and actively taking precautions to avoid them. Defensive coding is applied by adopting a language-specific set of idioms that minimize or prevent common coding errors when using the language. These idioms are often reflected in coding standards.</p>
<p>Here are some examples of defensive coding idioms for Java:</p>
<ul>
<li>When comparing if a variable is equal to a constant, put the constant first. This avoids a potential null-pointer exception (if the variable is null) by invoking the equals() method on the constant, which is never null.
<pre class="prettyprint">
public boolean isAdmin(String userId) {
  String constant = "admin";
  return constant.equals(userId); // Instead of userId.equals(constant)
}
</pre>
</li>
<li>Always use braces to define a block of code for an if, else, while, for, or do statement, even if the block contains only a single line of code. This avoids the problem of later adding a second line of code indented to the same level as the first and mistakenly thinking it will invoked as part of the block.
<pre class=" prettyprint">
public void addOptions(String userId) {
  if (isAdmin(userId)) {
    addAdminOptions();
  } else {
    addRegularUserOptions();
  }
}
</pre>
</li>
<li>Use the Java 5 for-each construct rather than using a loop index variable to manually iterate through a list. This avoids the problem of having an off-by-one error in constructing the loop.
<pre class="prettyprint">
public void processOptions(List<Option> options) {
  for (Option option : options) {
    option.process();
  }
}
</pre>
</li>
</ul>
<p>Defensive coding aims to minimize coding errors, both at the time of coding and in the future when the code is being modified by others. While these types of errors are typically easily detected by unit testing, I find that using these idioms (after the initial adoption) takes virtually no effort or thought on my part, making them literally a no-brainer to use.</p>
<h3>Example-Based Requirements</h3>
<p>The idea behind this practice is to express requirements as much as possible in terms of concrete examples rather than the generalized wording typically used in use cases and lists of business rules which is almost always ambiguous. I have written a separate article providing <a href="http://www.basilv.com/psd/blog/2010/example-based-requirements">further details on example-based requirements</a> which includes a specific example. :)</p>
<p>The practice of example-based requirements aims at minimizing requirement errors, particularly errors due to misunderstanding or misinterpreting. The examples should also be used as acceptance test cases, in which case they help detect design or coding errors (although unit tests should identify most of these first).</p>
<h3>Conclusion</h3>
<p>I encourage you to choose one or more of these practices to adopt in your current development work. There will be extra effort initially to understand and become comfortable with a given practice, but this will decline over time as you achieve mastery of it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2010/defect-prevention-practices/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why Coding is not Enough</title>
		<link>http://www.basilv.com/psd/blog/2010/why-coding-is-not-enough</link>
		<comments>http://www.basilv.com/psd/blog/2010/why-coding-is-not-enough#comments</comments>
		<pubDate>Mon, 28 Jun 2010 13:30:29 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=522</guid>
		<description><![CDATA[If the goal of software development is to produce working software then developers need to know more than just how to code - they need to know how to prevent or eliminate functional and non-functional defects. Too many developers think their job is complete once a feature has been coded. Sometimes they think that it [...]]]></description>
			<content:encoded><![CDATA[<p>If the goal of software development is to produce <a href="http://www.basilv.com/psd/blog/2008/our-mission-as-software-developers">working software</a> then developers need to know more than just how to code - they need to know how to prevent or eliminate functional and non-functional defects.</p>
<p>Too many developers think their job is complete once a feature has been coded. Sometimes they think that it is the tester’s job to find defects. Sometimes they think defects in released code are unavoidable and normal, so not worth worrying about. Sometimes they believe their code is perfect - it cannot possibly have defects. I encounter developers with these attitudes with unfortunate frequency. I also encounter development managers who are surprised to encounter such attitudes. A while back I talked to one manager who was shocked to learn than one group of developers under her were assuming their code worked if it compiled successfully - there were no reviews or any sort of testing being done. So I hope with this article to raise the awareness amongst developers that coding is simply not enough to produce working software, and to raise the awareness amongst development managers that they need to ensure the appropriate systems are in place to support this.</p>
<p>The reality is that even the most diligent developers inject defects into their code at a surprisingly high rate. Defect rates are often defined as the ratio of the number of defects per one thousand lines of code (KLOC). Industry statistics on defect rates are rather hard to find and vary significantly, partly because the definition of defect used varies. Several studies have reported defect rates in the range of 10 to 100 defects per KLOC as reported in the book <a href="http://smartbear.com/codecollab-code-review-book.php">Best Kept Secrets of Peer Code Review</a>. This works out to one defect per 10 to 100 lines of code. </p>
<p>On my most recent project I decided to calculate the defect rate for a particularly error-prone feature. Counting only defects found by independent testers <em>after</em> code reviews and unit testing were done, and using a KLOC count not including comments or blank lines, this feature had 20 defects for roughly 850 lines of code which is a defect rate of 24 defects per KLOC, or one defect for every 40 lines of code. This may seem reasonable, but remember that this is after multiple code reviews and automated unit testing have already found and eliminated a number of defects. (How many I do not know as these kinds of defects are not tracked.) And there still may be yet-to-be-found defects still lurking in this code. So the actual defect injection ratio is higher, perhaps much higher. </p>
<p>Defect rates have such a wide variance, even between developers working on the same code base, that it is unfortunately not a reliable metric for predicting defect counts. My main point in discussing them is to emphasize just how frequently defects are introduced. </p>
<p>Coding, therefore, is simply not enough. Every developer needs to have a personal system for preventing and eliminating defects, which should integrate into the system / processes used by the development team to produce high-quality working software. For ideas on how to assemble such a system check out <a href="http://www.basilv.com/psd/blog/2009/my-definition-of-done">my definition of done</a> that identifies a number of defect elimination activities.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2010/why-coding-is-not-enough/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Why You Should Be Using FindBugs</title>
		<link>http://www.basilv.com/psd/blog/2009/why-you-should-be-using-findbugs</link>
		<comments>http://www.basilv.com/psd/blog/2009/why-you-should-be-using-findbugs#comments</comments>
		<pubDate>Mon, 02 Mar 2009 14:35:48 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[tools]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[FindBugs]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=284</guid>
		<description><![CDATA[Build automation has been the theme of my recent learning activities, so when I came across multiple positive references to a tool called FindBugs I decided to give it a try. My conclusion: FindBugs is worth using on all Java projects. Read below for the details. FindBugs is a Java static analysis tool that scans [...]]]></description>
			<content:encoded><![CDATA[<p>Build automation has been the theme of my recent <a href="http://www.basilv.com/psd/blog/2006/personal-learning-by-doing">learning activities</a>, so when I came across multiple positive references to a tool called <a href="http://findbugs.sourceforge.net/">FindBugs</a> I decided to give it a try. My conclusion: FindBugs is worth using on all Java projects. Read below for the details.</p>
<p>FindBugs is a Java static analysis tool that scans compiled java code for potential defects and bad programming practices. Think of it as the Java compiler on steroids: it operates in roughly the same fashion but reports on a much larger set of errors and warnings. Static analysis makes a great complement to <a href="http://www.basilv.com/psd/blog/2009/java-unit-testing-tutorial">automated unit tests</a>. Unit tests require effort to write, targets a specific piece of code, but can verify application-specific functionality. Static analysis requires no effort to write (beyond initial setup), targets the entire code base, but can only verify general code constructs. At least, that's the theory. How does it work in practice?</p>
<p>FindBugs supports a number of different ways of being used: command line, Swing GUI, integration into automated builds (i.e. via an Ant task and Hudson plugin), and Eclipse plugin. I decided to go with the plugin, and installation was as easy as adding the update site <a href="http://findbugs.cs.umd.edu/eclipse">http://findbugs.cs.umd.edu/eclipse</a> and installing it. Well, actually there were two gotchas. First, you need to be running Eclipse version 3.3 or greater for the plugin to work – RAD version 7 will not work. Second, you need to fully restart Eclipse after installing the plugin. I made the mistake of choosing the option to activate the plugin without doing a restart, which left portions of the plugin not working. It also seemed like you must bring up the Bug Explorer and Bug Details views, then restart Eclipse, in order to get those views working properly.</p>
<p>I used the <a href="http://findbugs.sourceforge.net/manual/index.html">FindBugs manual</a> to get started. I selected my <a href="http://www.basilv.com/psd/blog/2009/time-reporter-version-20-available">Time Reporter</a> project to be the first guinea pig and ran FindBugs on it. I keep my <a href="http://www.basilv.com/psd/blog/2007/why-you-should-polish-your-code">code well-polished</a> and well-tested with most Eclipse warnings turned on so I was not expecting FindBugs to turn up anything major. As I scanned the relatively small list of issues (under 30), I was surprised to see an actual defect! It actually took me a few moments of staring at it to find the problem. See the screen shot below.</p>
<p><a href="http://www.basilv.com/psd/wp-content/uploads/2009/02/findbugscaughterror.png"><img src="http://www.basilv.com/psd/wp-content/uploads/2009/02/findbugscaughterror.png" alt="" title="FindBugs finding a defect in Eclipse" class="alignnone size-medium wp-image-285" /></a></p>
<p>It turned out this same defect occurred elsewhere in the code. FindBugs also identified cases of bad error handling that I would classify as defects: I was improperly ignoring return values from method calls like <code>File.delete()</code> or <code>File.mkdir()</code>. Most of these serious issues were in test code rather than application code, which made me feel a little better (but not much).</p>
<p>I fixed all the issues reported by FindBugs that I agreed with, turned off one FindBugs warning I did not agree with at all, and was then left with a small number of false positives – incorrect warnings about code that was actually correct. My preferred approach to compiler warnings is to have none in the code base. This is based on the fact that if your code base has existing warnings that should be ignored then it becomes very difficult to tell when you write some code that produces a warning that should instead be fixed. People become blind to all warnings if there are always some present. Warnings should be produced as part of the developer's regular process (i.e. writing code) rather than requiring an extra step. It was trivial to configure FindBugs to run automatically, but how to get rid of the unwanted warnings? </p>
<p>Eclipse warnings can be eliminated by using the Java 5 annotation <code>@SuppressWarning</code>, but FindBugs does not appear to support this annotation (or if it does I could not successfully determine what text must be supplied to the annotation to ignore the warning). I found some hints on the web that FindBugs does have the ability to ignore specific warnings, but it appears this feature has only been implemented in the Swing GUI and not in the Eclipse plugin. After further investigation I found a solution. I exported the current set of project warnings to an XML file, and then configured FindBugs to use that XML file as a baseline of warnings to ignore. This causes FindBugs to only report issues not in the baseline, leaving me with an empty list of FindBugs warnings – for now at least.</p>
<p>One annoying limitation of the FindBugs Eclipse plugin is that all configuration can only be done on a per-project basis. Unlike most other Eclipse functionality with global configuration that can be overridden on a per-project basis, FindBugs must be enabled and configured individually for each project. There were a few other wrinkles with the FindBugs Eclipse plugin. My general impression is that the plugin is still in a beta state, lagging behind the functionality offered by the FindBug Swing GUI. </p>
<p>Next I tried including FindBug as part of a continuous integration build running in <a href="https://hudson.dev.java.net/">Hudson</a>. It was fairly easy to configure the Ant build to execute FindBugs on the project: the only snag was needing to allocate more memory for the JVM. Installing and configuring the FindBugs plugin for Hudson was likewise straight-forward, and resulted in a nice set of pages for viewing the trends and details regarding the FindBugs warnings. The big issue came when I wanted to filter out (exclude) all the warnings I did not want to fix. After much investigation and trial and error, I discovered that if I used the FindBug Swing GUI to create an exclude list, I could then configure the ant build to use that list as an exclude filter. Using the filter created by the Eclipse plugin did not seem to work. The FindBugs documentation concerning this was rather poor, but I did see a statement admitting that filter support was a bit messed up across the various FindBugs tools, and it sounds like this area will be targeted for improvement in the next year.</p>
<p>Despite these tooling problems, the ease with which FindBugs finds defects makes it definitely worthwhile to use on all Java projects. When I turned FindBugs loose against the code base of a reasonably-sized production application, it found 5 definite defects, 14 cases that were either potential defects or extremely bad coding style, and many other warnings (I didn't bother to count, but probably over 100) that were worth investigating and often worth fixing.</p>
<p>My conclusion from this is that using FindBugs is definitely worthwhile. I plan to roll it out to all my Java projects and integrate it into the automated builds so that the FindBugs results are also available from the continuous integration server. If you plan to adopt FindBugs then I recommend checking out some <a href="http://code.google.com/p/findbugs-tutorials/">FindBugs tutorials</a>. If you want to promote the use of FindBugs to your coworkers or management then I'll point out that FindBugs is a corporate standard at both Google and eBay and eBay reports that "using 2 developers to audit/review FindBugs warnings was 10 times more effective at finding P1 bugs than using two testers" (<a href="http://findbugs-tutorials.googlecode.com/files/UFIA-intro.pdf">http://findbugs-tutorials.googlecode.com/files/UFIA-intro.pdf</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2009/why-you-should-be-using-findbugs/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Errors Errors Everywhere</title>
		<link>http://www.basilv.com/psd/blog/2007/errors-errors-everywhere</link>
		<comments>http://www.basilv.com/psd/blog/2007/errors-errors-everywhere#comments</comments>
		<pubDate>Mon, 11 Jun 2007 14:22:30 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[maintenance]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[error handling]]></category>
		<category><![CDATA[software development]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2007/errors-errors-everywhere</guid>
		<description><![CDATA[If you are a software developer and have not maintained operational applications with real users hammering away at it, then you are missing some important lessons. You might not fully appreciate the operational challenges facing the maintenance and support team, particularly when the software in question is suffering in the areas of reliability, performance, or [...]]]></description>
			<content:encoded><![CDATA[<p>If you are a software developer and have not maintained operational applications with real users hammering away at it, then you are missing some important lessons. You might not fully appreciate the operational challenges facing the maintenance and support team, particularly when the software in question is suffering in the areas of reliability, performance, or capacity. Over my period of involvement with application maintenance, I have been amazed at the number of incidents and problems that arise when an application goes into production. That is why in the last few months I have written several articles about reliability such as <a href="http://www.basilv.com/psd/blog/2007/error-handling-and-reliability">Error Handling and Reliability</a>.  </p>
<p>Based on my experience, I thought I had a good appreciation for what can go wrong. That changed recently when I experienced a day filled with too many problems and errors to be believed. The day started off innocently enough until a member of the production support team came by to inform us that he had accidentally terminated the database connection of one of our batch jobs due to transposing two numbers in the identifier. This by fluke matched our job instead of the one he wanted. Okay, no problem, we simply need to confirm that nothing was corrupted and restart the process. We checked our email for the notification email that is sent when a batch job abnormally terminates in order to verify which job had been affected. No such email was found. A little puzzled, we checked the server and confirmed that the process was no longer running. But another batch process was executing, and we identified it as a subsequent batch job dependent on the first. Subsequent jobs only run if the predecessors execute successfully, so we had a sinking feeling as we started checking the log files. Sure enough, due to a complete lack of error handling, the first job had reported a successful execution despite the database connection failure, which had caused the second job to start. That explained the lack of a notification email. The second job depended on the processing results of the first job, so the output of the second job was suspect and likely wrong. We had to kill the second job. If the first job had just failed, we could have restarted it without a problem, but now we had to investigate how to undo the effects of both jobs and manually restart the first. </p>
<p>Well, that didn't seem too bad, until I had time to think for a second. That is when I realized that our batch jobs are always scheduled at night or the weekends, and never during business hours. What was one doing running during the day? That prompted another investigation, which revealed that the job does normally runs on weekends. But the previous weekend there were problems with predecessors to this job that caused it to be delayed until one of the nights during the week. So why didn't it run at night? We were surprised to discover that it had – it just didn't finish. Due to performance issues, the job had run for over eight hours, extending into the day, before it was killed by mistake.</p>
<p>By late afternoon of that day we had multiple investigations underway trying to track down the various <a href="http://www.basilv.com/psd/blog/2006/how-to-do-root-cause-analysis">root causes</a> of the problems we had identified. My mind reached the saturation point sometime in the afternoon, so I cannot remember the details concerning what was found. I suspect there were other problems unearthed that I have since forgotten. Nor where we able to get everything fixed that same day. That combination of problems, coming together on that one day, kept a surprising number of people busy for days sorting out the mess.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2007/errors-errors-everywhere/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Examples of Root Cause Analysis</title>
		<link>http://www.basilv.com/psd/blog/2006/examples-of-root-cause-analysis</link>
		<comments>http://www.basilv.com/psd/blog/2006/examples-of-root-cause-analysis#comments</comments>
		<pubDate>Thu, 03 Aug 2006 15:00:27 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[professional]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[learning]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2006/examples-of-root-cause-analysis</guid>
		<description><![CDATA[This article is a continuation of my previous article on how to do root cause analysis . As I promised, this article provides examples of root cause analysis being performed. A famous example of root cause analysis is the presidential commission's inquiry into the 1986 US Challenger space shuttle explosion, particularly the observations of Nobel [...]]]></description>
			<content:encoded><![CDATA[<p>This article is a continuation of my previous article on <a href="http://www.basilv.com/psd/blog/2006/how-to-do-root-cause-analysis">how to do root cause analysis </a>.  As I promised, this article provides examples of root cause analysis being performed.</p>
<p>A famous example of root cause analysis is <a href="http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html">the presidential commission's inquiry into the 1986 US Challenger space shuttle explosion</a>, particularly <a href="http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt">the observations of Nobel prize-winning physicist Richard Feynman</a>. The basic finding of this investigation was that the explosion was caused by the failure of the O-ring seal in the right solid rocket motor due in part to unusually cold temperatures. They identified the problem, but did they find the root cause? The report does have a short section on the contributing cause of the accident. </p>
<p>Contrast this with Feynman's observations, which not once mention the O-ring seal and instead focus on deeper issues such as how NASA management evaluated shuttle reliability and safety. It was the solid rocket booster that failed, yet Feynman also investigated other major shuttle components. Feynman probed to the heart of the matter - the root cause - by not accepting limits on his <em>why</em> questions. He found that the engineers' warnings that it was too cold to safely launch went unheeded by management. His investigation was not without political consequences. Feynman's observations almost didn't make it into the final inquiry report - he had to fight to have it included - and it was relegated to an appendix. Feynman's final statement in his report elegantly summed up the root cause: "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled." </p>
<p>My other example of root cause analysis comes from my own experience on a maintenance team for an operational business application. End users of this application had discovered bad data in one of the database tables in the production system. Other people on the team looked into the problem and determined that it was caused by a missing database trigger. Not missing as in forgotten to be added originally, but missing as in the trigger existed at some point in the past, but no longer did. When I learned of this situation, I started my root cause analysis by asking <em>why</em> the trigger had disappeared. Naturally I didn't get an answer, unless you count "I don't know". It was time to start investigating.</p>
<p>Triggers don't disappear by themselves. Someone had made a change to the database schema that eliminated the trigger. I doubted that someone would have explicitly deleted the trigger, if only because everyone was surprised it was gone. So it was a mistake - some other change that inadvertently eliminated the trigger. My chief suspect was a change to the underlying table. If the table was dropped and recreated for some modification, then dependent objects such as triggers would have been automatically dropped and would have needed to be recreated as part of the change. Of course, the database administrators (DBAs) on the team who make all the database changes know this. So <em>why</em> then would the trigger not be recreated?</p>
<p>I needed to find out how the DBAs normally performed table changes. A few questions later, I learned that the typical approach was to use their DBA tool to extract the DDL definition for the table and all related database objects (views, indexes, etc.), make the required changes to this DDL, then run it. I then tried this procedure for myself, selecting a table with a trigger. I used the DBA tool to extract the DDL. To my surprise, the resulting DDL did not include the trigger definition. This meant that I had found the probable root cause. While I didn't definitively know that this caused the problem, I knew it was very likely. More importantly, it was an issue that could be addressed to minimize the likelihood of triggers being dropped in the future. I notified the DBAs about my findings and this defect in the DBA tool was submitted to the company that developed it.</p>
<p>While I had a likely root cause, this didn't mean I was done with the root cause analysis. There were still more <em>why</em> questions to ask. <em>Why</em> wasn't this problem noticed sooner by the maintenance team before the change went into the production system? Was a proper review performed of the change? <em>Why</em> didn't system testing or user testing detect the missing trigger? Where any other triggers missing on production database tables? For each of these questions there is the potential for an answer that will identify how to help prevent this type of problem from reoccurring.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2006/examples-of-root-cause-analysis/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>How to do Root Cause Analysis</title>
		<link>http://www.basilv.com/psd/blog/2006/how-to-do-root-cause-analysis</link>
		<comments>http://www.basilv.com/psd/blog/2006/how-to-do-root-cause-analysis#comments</comments>
		<pubDate>Thu, 27 Jul 2006 15:00:52 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[professional]]></category>
		<category><![CDATA[continuous improvement]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[learning]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2006/how-to-do-root-cause-analysis</guid>
		<description><![CDATA[Root cause analysis is an important activity whenever a problem occurs - whether it is a defect, an operational outage, or something else. Whatever the problem, your objective should be to not only resolve the issue but also prevent it from reoccurring in the future. To do this, you need to determine the root cause [...]]]></description>
			<content:encoded><![CDATA[<p>Root cause analysis is an important activity whenever a problem occurs - whether it is a defect, an operational outage, or something else. Whatever the problem, your objective should be to not only resolve the issue but also prevent it from reoccurring in the future. To do this, you need to determine the root cause - the key factor(s) that caused the problem to occur and that need to change in order to stop it from happening again. </p>
<p>Root cause analysis is essentially a learning exercise, and as such should be a fundamental practice if your objective is <a href="http://www.basilv.com/psd/blog/2006/perpetual-learning">perpetual learning</a> or <a href="http://www.basilv.com/psd/blog/2006/continuous-improvement">continuous improvement</a>. It disappoints me to so often see it done poorly or not at all. The good news is that the ability to do root cause analysis can be trained and improved.</p>
<p>At its essence, root cause analysis involves asking "Why?" coupled with the determination to find answers that will help permanently resolve or at least improve the situation being dealt with. To start, you simply ask "Why did the problem happen?". If the answer tells you how to fix the problem but not how to prevent it in the future, then you need to keep asking why. Even once you start getting more profound answers, you can often continue asking why questions and learn still more.</p>
<p>Asking why is easy - figuring out the answer is hard. Both analytical and creative thinking skills play a role. At times you may feel like a detective, ferreting out clues like Sherlock Holmes. When faced with a question concerning a situation with no apparent answer, I've found it helpful to brainstorm ideas for what possibly could have led to the situation, and from there try and determine whether each possibility could have occurred.</p>
<p>Root cause analysis is not for the faint-at-heart. Asking probing questions and searching for answers about why things went wrong can make you unpopular, especially if your investigation involves other teams and other managers. You need to be careful that your efforts are not perceived as laying blame or trying to score political points by making others look bad. Instead, emphasize that it is a learning exercise whose purpose is to prevent future occurrences of the problem.</p>
<p>Next week I'll present some real-life <a href="http://www.basilv.com/psd/blog/2006/examples-of-root-cause-analysis">examples of root cause analysis</a> that touch on the above points. But you don't have to wait. You can start practicing root cause analysis today - just ask "Why?".</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2006/how-to-do-root-cause-analysis/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>My Defect Fixing Process</title>
		<link>http://www.basilv.com/psd/blog/2006/my-defect-fixing-process</link>
		<comments>http://www.basilv.com/psd/blog/2006/my-defect-fixing-process#comments</comments>
		<pubDate>Thu, 22 Jun 2006 15:00:15 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[maintenance]]></category>
		<category><![CDATA[defects]]></category>
		<category><![CDATA[learning]]></category>
		<category><![CDATA[process]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2006/my-defect-fixing-process</guid>
		<description><![CDATA[What's your process for fixing a defect? What do you do when you are informed that a feature you developed isn't working to the users' satisfaction, or even worse fails to work at all? Here's what I do. Initial investigation. My goal is to reproduce the reported problem in the application in my development environment. [...]]]></description>
			<content:encoded><![CDATA[<p>What's your process for fixing a defect? What do you do when you are informed that a feature you developed isn't working to the users' satisfaction, or even worse fails to work at all? Here's what I do.</p>
<ol>
<li><em>Initial investigation.</em> My goal is to reproduce the reported problem in the application in my development environment. This may require obtaining more information from the user(s). If I can't reproduce it, I try to reason about possible causes to help me figure out what conditions are necessary to reproduce it. In hard cases, I may need to add extra error-checking or error-handling code to try to narrow down what is going on when the problem happens again.</li>
<li><em>Classify the problem.</em> Many problems are defects: a flaw in the code or design that prevents the application from working as designed. Sometimes the user may be having troubles with the user interface - I classify this as a usability defect. Or the existing functionality may be too limited and the user is requesting something new - this I classify as an enhancement request. The distinction between defect, usability defect and enhancement is quite blurry. Often I will treat a usability defect like any other defect that needs to be fixed, but if such a defect requires significant user interface changes, then I classify it as an enhancement. If I determine that the reported problem is really an enhancement request, then I am finished with my defect fixing process (enhancements are typically handled through a different process).</li>
<li><em>Analyze the cause.</em> My goal is to write an automated unit test that proves the existence of the defect by failing - i.e. I'd expect the test to pass if the defect didn't exist. I may need to do some analysis to determine what section of code is failing, and may need to iterate on this a few times if the defect is especially hard to pin down. For certain problems, including most usability defects, it is too difficult to write an effective test so I rely on manual testing and debugging.</li>
<li><em>Fix the defect.</em> Now that I know the cause of the defect, I can fix it. This is where having a test for the defect really pays off, since I can run the tests after making my fix and verify that the problem is corrected by having the tests pass.</li>
<li><em>Learn from the defect.</em> The defect may be fixed, but I'm not done. Each defect is a learning opportunity; it represents a failure in something I've done that I can correct and improve. So I ask myself questions like the following. Why or how was this defect introduced into the code? Was it an error in design, a copy-and-paste error, a failure to consider special conditions like a null return value, improperly understood library, etc.? Why didn't the unit tests catch this problem? Why didn't reviews or manual testing catch this problem? Would similar defects exist elsewhere in the code, due to the same or a related failure that caused this problem? What can I do to prevent this type of defect from happening again?</li>
<li><em>Act on my learning.</em> Based on what I've learned from this defect, I take action to improve myself, the application, and the team's processes for the future. Possible actions include writing more unit tests for the problematic section of code, refactoring a difficult-to-use API to eliminate the problem that occurred, informing other developers of why this defect happened, and improving error handling to make it easier to reproduce similar defects in the future.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2006/my-defect-fixing-process/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

