Test Everything That Could Possibly Break

An often repeated maxim in ExtremeProgramming is "Test everything that could possibly break."

You can't possibly mean that.

Testing everything that could possibly break means coverage analysis to make sure you've hit every line of code. Testing everything that could possibly break means trying all different combinations of inputs to make sure you don't miss one that will make the class misbehave. Testing everything that could possibly break seems like a nice thing to say but an impossible thing to do in practice.

The 80/20 rule applies. 80% of the benefit of testing is gained with 20% of the effort that would go into the mythical "complete test" (if there is such a thing). "Test everything that could possibly break" says, to me, try to write a complete test. If you do that, you'll be wasting a ton of time getting very little additional benefit.

Your input is requested, desired, and appreciated. -- WayneConrad

To TestEverythingThatCouldPossiblyBreak, of course, is to solve the GeneralHaltingProblem. The best you will ever do is test an approximation of everything. -- ScottJohnson

I prefer the phrasing "test anything you could have written incorrectly." If you tend to make mistakes writing simple accessor methods, then test them (of course, you only have accessor methods because some other code that you wrote test-first needs them, right?) -- KeithRay

This page is not saying to Test Everything. Instead, test those things that Could Possibly Break. Start by accepting that certain things work (the compiler and microprocessor work until proven otherwise). Test everything you layer on top of this axiomatic foundation. -- DaveEaton

As the number of function points increases in a system, the statespace of the system grows combinatorially. Even if the fastest vector of input takes only a microsecond to test, it will still take eons to test systems on the scale of the telephone network. Years on the scale of Microsoft Word. Most people have months if not weeks. Consequently, at a certain (common) level of complexity, it is infeasible to test everything. Stop with the mantras. You're sounding like a methodologist. I would, on the other hand, like to hear real solutions to these tough problems. -- SunirShah

I read "everything that could possibly break" as an indication of what not to test: it says it's OK to ignore stuff that can't possibly break (obvious examples would be accessor methods that do nothing but set/get the values of private attributes). It doesn't say "test every possible way in which anything could break". -- DanBarlow

Yes, Dan. Here's the beginning of Chapter 34, the last "Bonus Track" in Extreme Programming Installed:

Everything that could possibly break

Test everything that could possibly break. What does this mean? How is it possible?

In XP, programmers write unit tests for all their code. The rule is, Test everything that could possibly break.

Sometimes people get really angry at us when we talk about this rule. Don't you know it's impossible to prove that a program works by testing, they'll shout. Don't you know an infinite number of things could go wrong?

Yes, your authors took those courses and read those papers, too. Hold your horses, as people used to say in Jeffries' day, and hear us out.

First of all, like all the XP rules, this one is meant to make us think, and to keep us on the hook. In XP we turn all the dials up to 10, not up to Reasonable.

Test everything that could possibly break. As we test, this impossible banner waves over our head. As we abandon testing an object, we ask ourselves seriously whether we have tested everything about it that could possible break. Because we're realists, we stop before writing an infinite number of tests. But we try hard to test everything that could possibly break.

The chapter continues with discussion and an example. --RonJeffries

It's really great how well Ron's writing characterized my reaction. Perhaps the long form of the rule might be, "Test everything that could possibly break. Warning: This rule not to be applied by the anal retentive except with adult supervision." -- WayneConrad

That still leaves open my question re: the telephone network, or superlarge systems. How do you ensure that it works? Parts of me think that testing isn't going to cut it. Maybe there is no way to ensure that it works. -- SunirShah

You use static analysis (often informal, not automated) to break down the near infinite number of possible states. Good OO modular design helps by cutting down the interactions between parts. WhiteBoxTesting helps; in effect we construct a theory about the code which identifies the important cases. There is an interplay between static analysis (thinking about the source text) and dynamic analysis (running test code). Neither on their own is sufficient. -- DaveHarris

I think part of the answer lies in actual use testing. For most projects that don't have the funding of telephone networks, sometimes it's easier to let things break in the wild and then fix them. This is what many OpenSource projects do, but then again this encourages programmers to release really bad code. From experience with expensive customers, they expect defects. As long as you have an fast way to recover, they don't mind too much, provided it doesn't happen a lot. shrug -- SunirShah

If you have good UnitTests, and think that you really are testing everything that could possibly break, consider using a MutationTesting tool, to see how good your tests are.

Is it needed to write unit tests that don't break the first time they are run? See UnitTestsThatDontBreak.

What is the conclusion? Should tests be applied - or not (to everything that could possibly break) As a follow up - does this apply also to the Operating System under which the tests are performed?

Yes, unit tests should break first then pass. This verifies the unit test along with the code as the test goes from failing to passing.

I think we need to remember that we're talking about UnitTests here. These are written by programmers to ensure that they are writing what they think they are writing. System level testing is a validation of the program against its requirements, and is covered by AcceptanceTests in XP.

One of the benefits of a test-first approach is that it encourages you to construct the program in pieces that are easily isolated, thus breaking the combinatorial explosion. The more fragments you have, the easier it is to convince yourself that you have managed to TestEverythingThatCouldPossiblyBreak. You have to test groups of these fragments, to ensure that they work together as intended; but if the granularity of isolation is finer than the granularity of the tests then you can manage to test subsystems as units. -- DaveWhipp

''That means what? That groups of related units that work together are tested as units after they are tested singularly? Then the groups are lumped together in a larger group and unit tested? The natural extension of that would be to put all units and groups together and test them as a unit. When is a unit a unit and a system a system? What is a unit? What is a system? I am not familiar with the difference and would honestly like to know.''

Everything is a unit, and everything is a system. The difference is one of perception. A unit is the thing you are working on, now, at the level of abstraction at which you are working. A system is a collection of units that have been worked on at different times or by different people or at different levels of abstraction. Ideally, the collection is no more complex than any one of its components: if complexity increases with scale, then you have a problem.

I'm thinking as I write here. A definition that just popped into my mind is that the purpose of refactoring is to convert a system into a unit. That is, in a properly factored system, behaviour at any scale is fully expressed within a single clump.

The problem, then, is to ensure that the things that a class uses conform to the expectations of that class. Each of those things will have their own unit tests, but they only confirm that the thing does what it thinks it does, not what you think it does. No sane person would write their own set of tests for each thing that they use, so instead we write unit tests to check that the combination of 'its + your' behaviour meets your expectation. But you don't want to verify the full complexity of the whole system, so you stub out any classes that "it" depends on.

There is a tradeoff. If you draw the boundary to close then the tests are simple, but they don't confirm that it works in its real environment. If the boundary is too distant, then they number/complexity of tests explodes. The position of the boundary will depend on the complexity of the behaviour of your dependees. If you depend on a random number generator, then you want to exclude or control it. If you depend on an "add" function then it is simpler to include it than to exclude it.

It might be theoretically possible to write a test evaluation package that ran all your tests, figured out what lines of code the tests were traversing, and then tell you what lines/methods/classes are being untouched by the tests. Would such a tool be useful? It might be a lifesaver on legacy projects without UnitTests, perhaps less important on projects that use UnitTests from day one.

As someone I worked with once said, it would be good to feel that every line of code entered into the system had at some time been executed. Since each line is there presumably because it might get executed one day, testing that it works in at least one circumstance seems a minimum requirement. In other words, trying for 100% coverage (even if you have to construct tests and even have slight different implementations just to make that possible, is a GoodThing. -- PaulHudson

Many people say to not test obvious code like accessors. Others say don't test code that is indirectly tested by other tests. So this rule isn't universal even in spirit.

Moved from XpVsStandardDefinitionOfUnitTest

I have another question: How complete is your coverage in a typical XP UnitTesting suite? Most of my points become meaningless once you let you go of complete coverage. -- ss

I think coverage is an interesting metric in any approach, including XP. The plain human thing is to test what you think needs testing, whether you are XPing it or not. Depending on the quality of the tester and the need for perfection, this might not be enough. If there were a code editor that showed in red all the lines of source that weren't covered by the current tests, I'd sure use it. -- rj

I like tests because I am very lazy, somewhat pragmatic, and flagrantly ignorant of established practices. I have no idea whether what I'm doing is what XP calls a unit test (although I suspect I'm pretty close), but from my point of view it doesn't matter much. I do whatever testing makes my whole job easier, and ignore whatever testing makes my whole job harder. If 80% coverage (to pick a number out of the air) is pretty easy to do and has clear benefits, then that's the coverage I get. If 100% coverage is harder than I perceive the benefit to be, then I don't do it. In my mind, there's a curve of cost vs. benefit, and that curve has a sharp knee. I'm looking for that knee when I test.

To my mind, this is the key thing. Not what percentage you are getting for coverage, or do you have a test for every class, or do you test accessors or not, or is it what anybody else is calling a unit test or a functional test. The key thing is that you are paying attention to the benefit you get from your testing, and you push your testing up to the point that the hurt starts to outweigh the benefit and then back off a little. When you find a defect and there was a test you could have done that was easier than having that defect, perhaps next time you'll do that test. When you find that there's a class of defect you test for that turns up so rarely that you feel like you're wasting your time, back off on testing for that class of defect. Then pay attention to what happens. If you were wrong to stop testing for that kind of defect, you'll hurt more. If you were right, you'll hurt less. Hurt is a wonderful feedback mechanism.

Speak, then listen. Throw, then watch. Feedback. Feedback. Feedback.

There's a fair chance that I'm not following established XP practices at all (although it doesn't matter all that much to me). But I hope this does explain how I make testing worth my while. -- WayneConrad

I think then we might be on similar pages then. I like FunctionalTests, especially if I have to work with others. When I work alone, I can be a little more lax unless the project gets big. Also, I don't write tests for things I don't know how to write tests for, like GUIs and concurrent code. I don't test first, but that's because it's not my thing. I have other ways of staying focused, like driven iterative development and "obvious next step." -- ss

That's a great insight, that test-first programming is a way of staying focused. Thanks for that.

In practice, I'd doubt that every line has value worth testing. Consider languages like Java that force you to catch exceptions where you don't want to. Also, there's nothing in UnitTesting that guarantees that functionality is used somewhere other than the TestSuite. There are other testing methods, like code path analysis that are meant to deal with this. But UnitTests alone can't do it. -- SunirShah

AHEM. Thou shalt not catch exceptions you don't want to. See DontCatchExceptions for details.

Any line that might contain a critical fault would be worth testing, wouldn't it? I'm just saying it's a matter of perspective. Just as the Function Point estimators will disagree with the Lines Of Code estimators, there are at least these two ways to approach the testing of code. But neither makes it more or less UnitTesting, just as sitting on the hood or sitting behind the wheel don't change the fact that it's still yer car.
Moved from HowManyTests

Ok , so if you're doing ExtremeProgramming, how many tests do you have to write?

Let's say in Java, do you write one test per class?

What about in Scheme or C, one test per function?

Great question. If you are talking UnitTest, it's at least two per public function (the second test is what forces you to use something more than a hardcoded return). I believe that function is the wrong denominator, though. It seems to me the relevant (in XP, anyway) rates are UnitTest / UserStory or UnitTest / Task. I generally seem to end up with 5-10 per unit, but I'm not actually doing XP. -- DanilSuits

The problem with using the term UnitTest is that it is so completely and utterly bogus for the task at hand. (XpVsStandardDefinitionOfUnitTest) The question is "how many tests?" which can reasonably be altered to "how many constraints?" to avoid confusion. (Watch the hands wave.)

While Danil suggests two UnitTests are required per method, consider this style, using the CollectionAndLoopVsSelectionIdiom:

 bool TestMyClass::test_foo {
     struct {
         char const* sz;
         int expected;
     } aExemplars[] = {
         "quux",    42,
         "quuux",   24,
         "quuuux",  888,

for( int i = sizeof aExemplars / sizeof *aExemplars; i--; ) { MyClass fixture; if( aExemplars[i].expected != fixture.foo(aExemplars[i].sz) ) return false;

// After foo --> bar! if( !fixture.bar() ) return false; } return true; }

is this one, two, or three tests? And then it gets more interesting:

 bool TestMyClass::test_bar {
     // After foo, bar if and only if sz has more than one distinct character.
     struct {
         char const* sz;
         bool expected;
     } aExemplars[] = {
         "abra",     true,
         "cadabra",  true,
         "aaaa",     false,
         "",         false,

for( int i = sizeof aExemplars / sizeof *aExemplars; i--; ) { MyClass fixture; fixture.foo(aExemplars[i].sz); if( aExemplars[i].expected != fixture.bar() ) return false; } return true; }

Doesn't this also test foo()? And how many times now?

Better answer. The goal is to have every line of code justified and exercised by an automated test.

For the especially clever, this means that you don't have to write a unit test every time you add a line of code. Sometimes a higher level unit test can catch drift in much lower dependencies. Understanding when you can avoid writing unit tests is difficult, though. It's a trade off in efficiency vs. quality, and quality in and of itself is a trade off against efficiency and, well, efficiency. (Bugs cost time.) -- SunirShah

I'd say test_foo is obviously three tests; that the testcode is relatively dense matters not at all. Though it does make it difficult to automatically count how many tests there are.

I wonder, though, if we can make your Better Answer into an objective standard - if you can change (say by commenting out) a line of code without the tests breaking, you don't have enough tests. -- DanilSuits

So, how many, in the end? Are there any significant metrics? Complete code coverage can be too much on one hand, and on the other hand it can't be enough if tests fail to generate specific boundary conditions. I ask this because unit tests are particularly boring. The general impression in XP circles is that , for example in Java ,you have to write a UnitTest for every class that you write (of course that's how you design them). But I find this masochistic. Maybe it's good enough to test more extensively the most interesting part of the code. -- CostinCozianu [Whee, I can rudely interrupt you too!, Sure you can, but I can't ]

I haven't read the whole page but I have a few points to add. When we started experimenting with XP, we wanted to measure some really simple things. One issue that we had is that we would have a somewhat phased approach. We had the ability to estimate some parts of the project but there was always a black-hole test/integration phase at the end that resulted in multiple test/fix cycles. On one project that was fairly established, the test/integration phase took between two and four weeks. On this project's first XP iteration, the test/fix cycle was about 2 days. After that, it was one day typically. To me, that's a pretty good indicator (I didn't say metric) that XP helped this project to gain "control" over that black-hole phase. Managers seem to like this. Managers seem to be able to ignore metrics because they have somehow been trained to distrust the word "metrics" and equate it with statistics which has a bad reputation in their minds. However, simple indicators like "test/integration phase went from 3 weeks to 1 day" speak well to managers concerns. I hope to measure more of these. I put this under Costin's comments not only because I enjoy his input on these pages but because I wanted to touch on something he said. Before I tried real XP, I thought that testing was boring and that testing everything was masochistic. Now that I've been doing it for about 16 months, I can't imagine ever going back. It takes a few months of suspending disbelief and trying it really earnestly but the light does come on. XP-style testing is not boring. It is actually fun once you get over the learning curve. -- MikeCorum

The corollary to TestEverythingThatCouldPossiblyBreak is to write software to AvoidComplexity - thus accomplishing the twin goals of increasing reliability while reducing the number of failure states that need testing. I think most of the comments above touch on this idea, but no one comes out and says it. -- MalcolmCampbell

I have found assertions good way to test code. First thing is to find, what to assert (20%/80% vs 100%). Most obvious things are in coming arguments. But what causes need for testing arguments? Obviously passtrough arguments don't need observation. Also don't arguments with valid value range matching ordinal types range. But those arguments, which uses some ordinal type more wider that program logic needs, needs to be validated. If that kind of argument assertion fails on runtime, you can see from callstack, that is reason for that illegal caller or should that assertion to be promoted to actual runtime error handler for that discovered new situation (desing flaw, unseen case).

Second type of assertion is to quard class members to not be setted on illegal state (null etc.). On that kind of situations you don't have callstack to be spyed on final runtime exception time, but on guarding assertion time you can see similarly, who tryes to put cargabe on object to be stalled later...

That style might not be pure XP, but so efisient, that I have writen whole package of classes in week (detail designing via "write java docs first") and finally performed two dosen test runs and each one have showed one voluntery bug to easely corret.

It is very easy to add "assert null != arg;" for all meaning full situations, and my mutual cost/pay back analysis says that if one "mine" explodes for each writen 1000 assertions and saves half day debug session with steping code by debugger and watching memory state, half second type time of each assert is worth. For that we can say, that try to create thight mine field to inhibit execution flow to branch on unthinked areas. Also state pattern helps that kind of flow control controlling.


See also: MutationTesting, CompleteCoverageIsExpensive, QualityElbow, AllThePossibilities


View edit of October 27, 2005 or FindPage with title or text search