«    »

When is Testing Done?

I have been asked several times recently about the question of when testing can be considered 'done' for a piece of software? A related form of this question is to ask when one should stop testing. This applies to both developers and testers for any type of testing ranging from writing automated unit tests to user acceptance testing. I like poising this question to others as a means of stimulating reflection and discussion. So before you read the remainder of this article and see my answer, please stop and take a minute to reflect and think of how you would answer this.

(Did you stop and think?)

In a philosophical sense, I could argue that testing is never truly done. Especially if testing is defined using James Bach's broad definition of questioning a product in order to evaluate it, I could easily create examples of testing happening days or weeks after the examination of the product has finished. Even ignoring this broad definition and using a narrower definition of executing software to verify that results match expectations, there are still an essentially infinite number of possible tests that can be performed on even the most trivial of software, so there is no way to ever be fully done. These philosophical points, however, do not match the intent of my question, which is really about the allocation of effort towards testing. When should you, as a developer or tester, stop putting effort towards testing a piece of software and consider it done? This is essentially elaborating upon a definition of done for testing.

My answer to this question is based upon the specific goals or objectives for the testing being performed. In my experience there are usually two primary goals to testing:

  1. Find defects.
  2. Assess whether the software is ready to be promoted / released.

Testing can have other goals - for a fuller discussion, see the article What is a Good Test Case? (pdf) by Cem Kaner.

So the simplistic but essential answer to my question is that testing is done when its objectives have been achieved. More specifically, you are done testing when:

  1. You are unlikely to find additional defects.
  2. You have a sufficiently high level of confidence that the software is ready to be promoted / released.

These answers are rather brief and do not provide much in the way of actionable guidance, so I expand on them below. Before I do so, however, I must address another factor: project constraints of budget, schedule, and resource availability. If you run out of budget or time, you generally need to stop testing. This does not mean, however, that testing should be considered done. On the other hand, testing for an indefinite period of time oblivious to cost or duration does not seem appropriate. Ideally a balanced approach is taken. The primary testing goals are modified to incorporate the notion of employing a reasonable level of effort (cost) and duration given the desired quality level. For example life critical software demands much higher quality and thus much greater effort towards testing than software intended for casual, personal use.

Defining Done: Finding Defects

If your goal in testing is to find defects, then ideally you should stop testing after you have found all the defects. Unfortunately, there are several flaws with this theory:

  • Testing is very unlikely to find certain types of defects. Even the combination of several different styles of testing is unlikely to find more than 75% of the total defects. Each individual type of testing, even when carefully executed with a high degree of skill, is unlikely to find more than 50% of the defects.
  • Testing is relatively inefficient at finding defects, even assuming you could find all of them. You will potentially spend a very, very long time trying to find all the defects, especially the last few.
  • You have no way of knowing in advance if all the defects have been found, or whether more remain in the system.

So in practice you need to abandon the idea of finding all the defects and use a different approach. Instead, evaluate the likelihood of finding more defects based on the effort you have already put in and based on the results you have obtained so far. If this likelihood is too low then you are done testing: additional testing would not provide a sufficient return on investment in terms of new defects found compared to the effort expended. Some points to consider when making this evaluation:

  • If you have just found a defect, this is a signal to keep testing. It may seem counter-intuitive, but in general the more defects you find, the more likely it is that there are additional defects.
  • If you have only exercised a small portion of the overall functionality and already found defects, then this is a signal to continue testing.
  • If you have been testing a particular piece of functionality for a while and are not finding new defects, then this is a signal for you to stop testing.
  • If you are struggling to come up with new tests that are meaningful, then this is a signal that you are done. Another sign of this is when the defects you do find are found to have low relevance to users and the decision is made not to fix them.
  • If the tests you are performing are becoming more and more complicated and taking significantly more effort, but you are only occasionally finding defects, then this is a signal to stop.
  • When testing a larger set of functionality like an entire application, the question of whether to stop testing can and should be applied at the level of individual features or components. One reason for this is that defects tend to cluster. This commonly leads to a system having a few error-prone components that account for up to 80% of the total number of defects, while other components can be significantly higher quality. Once you have identified a component that appears error-prone, focusing additional testing on it is quite likely to find more defects.
  • When testing a large system, you should revisit the decision to stop testing a particular area as you gain more information. For example, imagine you have three features of a system to start testing for the first time. You start testing the first feature, and initially find a bunch of defects without much work, but shortly afterwards the defects seem to become much harder to find. At this point, it is best to stop testing feature one and switch to feature two. Feature two you find no defects in at all, so after a shorter period of time you switch to feature three. Here you find the occasional defect every so often, but eventually find it hard to create meaningful tests. Since it seemed like feature one had the highest defect density, you return to it to test, but after a short while and only one new defect found you stop again. At this point you realize that feature two has received the least testing, and there are parts of its functionality you have not tried, so you return to test it, and do find one defect, but no more so you stop. Testing feature two happens to give you some ideas for new meaningful tests to try for feature three so you return to testing it, but find no new defects there either, so at this point you decide you are done testing all three features.

Defining Done: Assessing readiness

If your goal in testing is to assess whether the software is ready to be promoted to the next level of testing or released into production, then you are done testing once you have obtained a sufficiently high level of confidence in your assessment. If your assessment is a 'no-go' - do not proceed with the promotion / release - then it is likely that issues you found will need to fixed and trigger additional testing in order to obtain your 'go' recommendation. (This assumes, of course, that there is sufficient budget and schedule for additional testing and also assumes that the relevant decision maker(s) care about quality and pay attention to your assessment. If this is not the case, then you perhaps should not make assessing readiness a testing goal.)

Factors to consider in performing this assessment are:

  • What level of quality is required? Is the system life-critical, mission-critical, or only for casual personal use? This quality level relates directly to the level of confidence you need to have that the system is ready.
  • It is far easier to determine that the system is not ready. Finding a critical defect (aka a showstopper), or finding even a few serious defects in critical functionality is usually sufficient to warrant a no-go assessment. In contrast, achieving sufficient confidence that the system is good for release usually takes more work.
  • How much of the system's functionality have you tested? If there are significant features that are mostly or entirely not tested, then you likely will not be prepared to recommend it is good to go. So frequently you should favor testing critical functionality broadly across the entire application versus focusing in detail on a single component. This conflicts to some degree with the approach suggested for the prior goal of finding defects where you might choose to focus your testing on an error-prone module.
  • False confidence is a very real danger. This is especially true for developers - I have seen or heard of many who have an unrealistically high level of confidence that their code works, in some cases without any testing whatsoever. As a specific example, I have heard developers state that if their code compiles, it is good enough to be promoted to acceptance testing. (And this was for business-critical applications, not casual use.) This is one reason to have on development teams one or more testers with a testing mindset who will ensure for themselves that the system will work rather than assume it. To combat this false confidence it may help to read my article Would you trust your life to your code?.
  • Your testing should be designed as much as possible to find defects of the highest severity and highest relevance to users, as these are the kinds of defects most likely to warrant a no-go assessment. You might be able to find lots of cosmetic defects like spelling mistakes in the application's help documentation, but this does not really contribute much towards an assessment of readiness. You may have noticed that even when discussing the prior goal of finding defects, my assumption is that you will bias your efforts towards finding more severe and more relevant defects.

My goal for this article was to motivate you to think more carefully about the question of when testing is done and enable you to be more effective as a tester. To test :) whether I achieved this goal, please leave a comment below letting me know what you thought of the article. Thanks!

If you find this article helpful, please make a donation.

3 Comments on “When is Testing Done?”

  1. Keyur Amin says:

    Good article.

  2. Ruslan Urban says:

    Basil, check out BDD – Business Driven Development. That is another way to ensure requirements and code synchronicity, and avoid having a slew of regression issues. The requirements in Gherkin notation can be maintained by non-technical people, like BAs, but a change in the requirements will automatically trigger a change in test coverage, and can help detecting regression issues even before testing of the introduced changes manually.

  3. Thanks for the tip Ruslan.

«    »