The Cardinal Rule of Automated Testing

I’ve had numerous conversations over the past year or so with people who aren’t that familiar with writing automated tests (be they unit tests, functional tests, or what have you), and all the advice I give can really be summed up in one simple rule:

Your tests should break if and only if the product is broken

It seems pretty obvious on the face of it, which it is, but it also has some non-obvious implications.  I’ll start by breaking the statement out into its two halves, and then drill down on each to discuss its implications.

Your tests should break if the product is broken

On the surface of it, this is just a simple statement about test coverage:  you want to make sure that as much of your application as possible is covered by automated tests.  Full coverage of all combinations of behavior is, naturally, not possible, and there’s a tradeoff between the effort spent implementing and maintaining the tests and the number of bugs they prevent, so you naturally have to draw the line somewhere.  It’s important to know where you’re drawing the line, though, so this rule also implies that you need to have some sort of systematic description of what your application does that you can match up with your tests so you know what’s covered and what’s not.

The less obvious implication of this rule is that your tests should mimic, as closely as possible, the actual production environment and production paths through the code.  You can have what you think are the best unit tests in the world, but if all the units themselves are only tested in mocked-out, test-isolated bliss you might find that your tests don’t catch any actual product breaks.  There are more tradeoffs to be made here as well, this time between test development ease, execution speed, and maintenance burden and the realism of the tests.  But as a general rule, the closer you can get your automated tests to exercising your full stack in the same way it will be exercised in production, the more bugs your tests will catch.

Your tests should only break if the product is broken

This is, by far, the harder rule to abide by.  Test coverage is largely just a simple cost-benefit tradeoff:  more effort == more test coverage.  Reducing false positives in tests, however, is a much harder business.  We’ve been writing and debugging and maintaining huge stables of tests for years now, and I’d guess that 80% of our test breaks are still false positives in the sense that they indicate that the test needs to be fixed rather than that the underlying code needs fixing.  All that test fixing takes an enormous amount of time.  We pay the price because it’s worth it, but we’re always trying to find ways to reduce it.

One way to reduce it is via the aforementioned strategy of making the tests mimic the production environment (as we’ve done with our TestBase infrastructure).  That also prevents false positives by ensuring that tests don’t break simply because the test code path has diverged from the production code path, for example by having a mock no longer properly implement the semantics of thing being mocked out, or having a test set up data that’s not valid.

A related technique is testing at the highest level of abstraction possible.  This technique tends to prevent false positives that come because of implementation changes that don’t actually affect behavior; basically, if you unit test at too fine a level you end up pinning yourself to a certain implementation, and changing that implementation can cause a massive test maintenance headache even if the change is correct and behavior-neutral.  For example, I wrote the XSD typeloader for the Bedrock series of releases, which involved a ton of helper classes for handling all the different XSD features.  Those classes all have some unit tests, but I put much more effort into testing things from the GScript level on down, which is really the level at which the semantics of the type system are meaningful; everything else is just an implementation detail.  As I changed the implementation around (which I did often), those tests would still be valid since they described invariants about how the system should function regardless of the implementation, whereas I’d simply throw out the unit level tests that became invalid.  Since I had the higher-level tests, I wasn’t as worried that I’d lose coverage by throwing out those unit-level tests.

One of the more difficult problems in avoiding false positives is around web UI testing:  you generally either have to pin your tests to the text on the page, which is naturally pretty fickle, or you have to pin your tests to generated HTML ids, which are also often fairly unstable.  Our old, pre-Bedrock UI testing framework had this problem:  every time someone renamed a link or changed a button, some large number of tests would break with incomprehensible error messages that were difficult to track down (i.e. is the button “foo” not on the page because it’s been renamed, removed, or because the page is just broken?).  We’ve solved a lot of those problems with our new framework (not yet fully baked for customer use at this point, I believe) that exposes a strongly-typed object model for writing tests against the pages, which means that renaming or removing a button will cause related tests to break at compilation time, making it infinitely easier to fix them proactively.

Lastly, good coding practices will generally help you out with reducing false positives, or at least making them easier to identify and fix.  Treat your test code like it’s real, production code and pay attention to decomposition, code reuse, variable and method naming, etc.  It might not strictly result in fewer false positives, but it will make them easier to fix if the fix only needs to be made in one well-named method instead of in dozens of copy-and-pasted methods littered all over the code base.

These might all seem like obvious rules, but sometimes we overlook the most obvious things, so it never hurts to go back to first principles and ask basic questions like “Will this test break if the product is broken?” or ”
How do I make this test robust in the face of potential implementation changes?”

6 Comments on “The Cardinal Rule of Automated Testing”

  1. Paul Loveridge says:

    Good points but you seem to be mixing your unit tests with your acceptance tests.

    At my company we use unit tests for ensuring the code functions properly and thats ALL it does. We then use an acceptance test framework (currently Fitnesse but we’re migrating to Jelly scripted tests) to test the product path.

    Your unit tests should be quick and run in minutes whilst your acceptance test can take as long as necessary.

  2. Alan Keefer says:

    We actually do have a split between unit tests and acceptance tests as well, but it’s not as hard a split as I think it usually is. We call our acceptance tests “smoke tests” and they test from the UI level on down, but they run on the same test infrastructure more or less, they’re all written in GScript (our unit tests are a mix of Java and GScript, depending on what language the unit under test is), and they all run in the same automated harness (the smoke tests take a bit longer than the normal unit tests, but not significantly so). Your mileage may vary, but that’s what we’ve found works best for us, partially because the smoke tests are so much more valuable as far as making sure the application works end-to-end.

    I think the general rule “your tests should break if and only if the product is broken” applies equally well to both cases. If you like, you can change “the product” to “the unit under test” to make it more clear. In the case of acceptance tests, the unit under test is kind of the whole product; in the case of unit tests it might be a single method. Either way, though, reducing false positives and false negatives is important and incredibly difficult.

    I haven’t ever seen much written on the subject of test maintenance, though, which makes me wonder how many people actually have massive unit test suites that they retain over several versions of the same product. I imagine they do, but I haven’t seen much written about it; I know the sorts of problems that we run into, but I’d be really interested to hear about what other people run into.

    We’ve got > 40,000 test methods at this point, and even if 1% of those unit tests are “bad” tests that break erroneously, that means you could make a 20-minute code change and then have to fix 400 broken tests. When half of those tests were written at least 3 years ago using a less-advanced test architecture, against an application that’s gone through three major releases, keeping that from happening just becomes really, really hard. I think it’s important to design your testing infrastructure and the tests themselves with that sort of longevity in mind. Until you run into those problems, though, I don’t think it’s always obvious what pitfalls to try to avoid.

  3. Hey, I was just thinking the same thing. I’ve run into some similar problems at work (although not so massive as you describe). Our test infastructure has evolved, so there are many different approaches to testing, even inside the same file.

    However, most of this was because some classes got very big and difficult to understand (the fact that I’m just a newbie doesn’t help at all), and of course their tests got huge.

    Another thing is that we do TDD, so tests are “worshipped”, and therefore people are loathe to change them (that’s unit tests – acceptance tests are pretty much straightforward and will catch everything). So the question arises, how do you test tests? How can they become as robust as possible, testing the correct things, not testing irrelevant stuff?

    My idea was to refactor the classes under test to use as little state as possible, and promote as many methods as possible to pure functions, as functions are inherently easier to unit test, you don’t have complex state interactions and you don’t have to mock out lots of stuff or do any big setup.

    I like the idea of putting implementation specific helpers into separate files, so they can safely be deleted if the implementation changes. I’ll try that next for my current user story.

  4. Alan Keefer says:

    State is definitely the enemy of testability; the closer you can get to a pure functional approach, with clear inputs and outputs, the better. My experience is that that gets harder to do that the higher up the stack you are: it’s pretty easy to write a Set implementation that’s easy to test, or a String manipulation library, but much harder to do with something like a UI-level widget that could interact with the DB, a business object, the incoming request, etc. But at least separating out the pure functional stuff from the bits that really are state dependent can help.

    My experience is also that mocks are generally pretty evil unless they’re really, really tightly defined; trying to mock out an sort of moderately complex interface generally leads to the mock diverging from the real system behavior, which either results in false positives (your code works against the real impl but not the mock, so the test fails) or false negatives (your code works against the mock but not the real impl). It’s hard, though: mocks let you decouple things, and used properly than can help ensure you stay decoupled.

    Our experience is that it’s just too hard to do that, and the price of having bad mocks is astronomical, so we hardly use them anymore and just worked on making our full stack as quick to start up as possible. That makes the tests take longer to run and removes the de-coupling benefits, but has dramatically reduced the amount of test maintenance we have to do. It’s a tradeoff I wish we didn’t have to make, but that’s where we’ve ended up.

    We’ve also had times in the past where our unit tests worked but the system as a whole didn’t because we focused too much on the unit level and not enough on the functional/acceptance level. We even once had a rule on a new product that everything had to be 100% unit tested, which just lead to people gaming the test coverage metrics and didn’t seem to have any positive impact on keeping the app working. The only thing I’ve seen work are higher-level tests. Acceptance tests are, in my opinion, a harder problem: harder to write, slower to run, much harder to refactor as the application changes, and the explosion of combinations means you can only ever test a small fraction of all the possible interactions. But they also do a much better job of catching bugs than unit tests.

    One partial solution to that conundrum is to write subsystem-level “unit” tests and end-to-end feature “unit” tests that work at an intermediate level of abstraction; higher than unit tests since they test a whole host of classes working together, but lower-level than acceptance tests working against the whole application. So for example, I mention the XSD typeloader that produces types in GScript based on an XSD. I wrote unit tests for many of the implementation classes, but I also wrote tests from the client perspective; if the implementation changed I just threw out those unit tests that were no longer relevant, but the end-to-end tests generally stayed working. I’m also not the strictest TDD adherent, so I wouldn’t always rewrite the unit-level tests in those cases; if the implementation is in serious flux, writing and re-writing the tests costs more than it’s worth to me. I do the same thing with new metadata features for our ORM layer; if I add a new kind of property, I start with the end-to-end behavior of the property (how it behaves in GScript, in Java, how it’s stored in the DB, how you query on it, etc.) and write those tests, then fill in more detailed unit tests as necessary for the individual classes involved in the implementation of that feature. Those end-to-end tests aren’t real “unit” tests, but they tend to be more likely to express the right sort of invariants about the application’s behavior and be robust in the face of future changes.

    We’ve also put a huge amount of effort into making it easier to write smoke tests so we can have better end-to-end coverage of our application, which is really a topic for a separate post.

  5. We have acceptance tests (we call them just FunctionalTests) for each user story and each defect (we write new ones as we fix/implement things).

    We try to stay as near the UI level as humanly possible: When the test runs, you can see the mouse moving around clicking stuff, text being typed in, failures simulated (we have a test that “unplugs” the network cable and checks that the program responds correctly). All from the users’ point of view, so unless the specification for something changes, these tests pretty much stay the same.

    This can be very helpful, especially for new people since you can pretty much hack around all you want, and you will really know if you’ve broken something. The drawback of course is that a full run takes around 4 hours to complete, but we can distribute that to different machines to make it smaller.

    I like TDD, but people can get confused about it, especially if they forget the last step which is “refactor, make sure the tests still pass with no modification”. This is a bit similar to your cardinal rule.

  6. […] The Cardinal Rule of Automated Testing – Alan Keefer discusses an important premise of testing. His tests seem to be more like integration tests than unit tests, byt a lot of the principles discussed apply to both […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s