«    »

Predicting and Evaluating Defect Levels

Is it possible to predict how many defects will be encountered in acceptance test or production? What number of defects would be considered reasonable versus signs of low or high quality? These are questions I considered when my last project entered acceptance test. At the time I had no good answers. So over the past months I have been searching for information on defect levels and quality metrics that could help answer these question. The most useful source I have found is Caper Jones, a researcher and consultant on formal software estimation.
Caper has written a number of articles and books in which he provides metrics on defect levels based on benchmarks derived from literally thousands of projects. The ones I found most useful were:

The information I provide in the remainder of this article comes primarily from these references.

Estimating Software Size

The first step in Caper's approach is to estimate the size of the software. His preferred size metric is function points, which are a language/technology neutral evaluation of the business functionality based primarily on an assessment of program inputs, program outputs, and data storage. Another common metric is logical lines of code, often abbreviated as SLOC. KLOC represents 1000 lines of code.

Lines of code are a convenient metric in that they can be measured automatically with a tool, whereas function points required a trained function point counter. Nevertheless, function points are superiour in several key ways:

  • Roughly fifty percent of defects are due to problems with requirements or design rather than coding, for which metrics in terms of lines of code do not make much sense. In particular, different implementations of the same functionality can have significant variances in the lines of code required by up to a factor of four. A longer implementation will have fewer requirement or design defects per KLOC and thus artificially seem to have higher quality, when in reality the code is simply bloated. (This is related to the issue of measuring developer productivity by lines of code produced.)
  • The use of multiple languages complicates source code counts, and this is far more common than one may expect. Even a simple web application typically includes JavaScript, HTML, CSS, and perhaps SQL in addition to the primary language (e.g. Java). Function points are language-independent.
  • Function points can be determined once the design is known - lines of code require waiting till coding be completed. This allows function points to be useful for planning and estimation purposes much earlier.

Since Caper prefers function points, the defect metrics he provides in his writings are typically expressed using this measure. While I understood the reasons, I found these metrics difficult to apply because I had no idea of what the function point counts were of the software I was working on. So I was pleased to discover that benchmarks such as this one exist to convert between function points and lines of code for various languages. One function point on average is roughly 50 logical lines of Java code. I use this conversion factor in the sections below.

Defect Potential

All software has the potential of having defects. Defect potential is a measurement of the expected number of defects in a particular piece of software. This is also called the injection rate - the number of defects being introduced throughout development. The primary factor determining the number of defects is the size of the application in function points. The maturity - experience, skill, and attention to quality - of the development team is another key factor in determining defect potential. The following table specifies how to calculate the expected number of defects given these two factors.

Maturity Level Defects / Function Point Lines of Code / Defect
Worst Organizations 9 6
Average Organizations 5 10
Best Organizations 2 25

Some key defect prevention activities contributing to reduced defect potential are:

  • Close customer collaboration during requirements / design (e.g. JAD sessions)
  • Prototyping
  • Feedback / learning from design and code reviews

The above metrics assume a linear relationship between defects and size, but as a system gets larger there are typicaly more interactions between pieces and more complexity, and thus a greater likelihood of defects than a linear increase would suggest. For very large systems, a more accurate metric is as follows: number of defects = function points raised to an exponent. For average organizations, use 1.25 as the exponent. (Good organizations can lower this to 1.15, while poor-performing organizations have this elevated to 1.35.)

Defects can be categorized by origin - the type of activity that produced the defect. The table below shows this breakdown for average organizations.

Defect Origin Defects / Function Point Lines of Code / Defect Percentage of Total
Requirements 1 50 20%
Design 1.25 40 25%
Coding 1.75 29 35%
Document 0.6 83 12%
Bad Fixes 0.4 125 8%

Defect Removal

Defect removal is the identification and elimination of defects after they are introduced. The cumulative defect removal rate or defect removal efficiency of a development project is calculated as the number of defects eliminated prior to the release to production divided by the total number of defects found after 90 days of production use.

The following table shows how the defect removal rate varies with the maturity level of the team, just like defect potential, and shows the expected number of post-release defects based on the defect potential and removal metrics.

Maturity Level Defect Removal Rate Post-Release Defects / Function Point Lines of Code / Post-Release Defect
Worst Organizations 60% 3.6 13
Average Organizations 85% 0.75 67
Best Organizations 95% 0.1 500

Removal efficiency varies for defects of different origins, as the following table shows using statistics for average organizations.

Defect Origin Defect Removal Efficiency
Requirements 77%
Design 85%
Coding 95%
Document 80%
Bad Fixes 70%

Quality control procedures such as testing and reviews (inspections) vary in their effectiveness at removing defects as illustrated in the following table.

Quality Activity Average Defect Removal Rate Peak Defect Removal Rate
Requirements review 30% 50%
Design review 40% 65%
Personal review (design or code) 35% 60%
Code reviews or pair programming 50% 70%
Unit testing (automated or manual) 25% 50%
Functional testing 30% 45%
Regression testing 20% 30%
Performance testing 15% 25%
System testing 35% 50%
Acceptance testing 30% 45%

Peak defect removal rates for a given activity are typically obtained only through the use of skilled, experienced staff who take a rigourous, disciplined approach to performing the activity. As an example consider unit testing. As it is normally performed it has a 25% defect removal rate. But an experienced developer following test-driven development will achieve nearly 100% code coverage and typically write better tests, leading to a higher 50% defect removal rate.

Predicting Defect Levels

The overall or cumulative defect removal rate for a develoment effort can be calculated by aggregating together the individual defect removal rates of the quality control procedures used by the team as follows: Cumulative removal rate = 1 - the product across all procedures of (1 - individual removal rate per procedure).

For example, if a team uses only unit testing (25% removal), functional testing (30%), and regression testing (20%), then the cumulative rate = 1 - (1-0.25) * (1-0.30) * (1-0.20) = 0.58 or 58%.

The overall number of defects in production (or UAT) can be calculated using the defect potential of the team to determine the expected number of defects introduced, and using the cumulative defect removal rate to determine the number of defects remaining. Approximately 25% of defects will be high severity.

For example, an average organization injecting 1 defect per 10 lines of code for a 25 KLOC application will end up introducing a total of 2500 defects. Given a cumulative defect removal rate of 95%, this means that 2375 defects will be found, leaving 125 defects remaining of which 31 can be expected to be high severity.

If you find this article helpful, please make a donation.

«    »