The Cardinal Rule of Automated TestingPosted: April 22, 2008
I’ve had numerous conversations over the past year or so with people who aren’t that familiar with writing automated tests (be they unit tests, functional tests, or what have you), and all the advice I give can really be summed up in one simple rule:
Your tests should break if and only if the product is broken
It seems pretty obvious on the face of it, which it is, but it also has some non-obvious implications. I’ll start by breaking the statement out into its two halves, and then drill down on each to discuss its implications.
Your tests should break if the product is broken
On the surface of it, this is just a simple statement about test coverage: you want to make sure that as much of your application as possible is covered by automated tests. Full coverage of all combinations of behavior is, naturally, not possible, and there’s a tradeoff between the effort spent implementing and maintaining the tests and the number of bugs they prevent, so you naturally have to draw the line somewhere. It’s important to know where you’re drawing the line, though, so this rule also implies that you need to have some sort of systematic description of what your application does that you can match up with your tests so you know what’s covered and what’s not.
The less obvious implication of this rule is that your tests should mimic, as closely as possible, the actual production environment and production paths through the code. You can have what you think are the best unit tests in the world, but if all the units themselves are only tested in mocked-out, test-isolated bliss you might find that your tests don’t catch any actual product breaks. There are more tradeoffs to be made here as well, this time between test development ease, execution speed, and maintenance burden and the realism of the tests. But as a general rule, the closer you can get your automated tests to exercising your full stack in the same way it will be exercised in production, the more bugs your tests will catch.
Your tests should only break if the product is broken
This is, by far, the harder rule to abide by. Test coverage is largely just a simple cost-benefit tradeoff: more effort == more test coverage. Reducing false positives in tests, however, is a much harder business. We’ve been writing and debugging and maintaining huge stables of tests for years now, and I’d guess that 80% of our test breaks are still false positives in the sense that they indicate that the test needs to be fixed rather than that the underlying code needs fixing. All that test fixing takes an enormous amount of time. We pay the price because it’s worth it, but we’re always trying to find ways to reduce it.
One way to reduce it is via the aforementioned strategy of making the tests mimic the production environment (as we’ve done with our TestBase infrastructure). That also prevents false positives by ensuring that tests don’t break simply because the test code path has diverged from the production code path, for example by having a mock no longer properly implement the semantics of thing being mocked out, or having a test set up data that’s not valid.
A related technique is testing at the highest level of abstraction possible. This technique tends to prevent false positives that come because of implementation changes that don’t actually affect behavior; basically, if you unit test at too fine a level you end up pinning yourself to a certain implementation, and changing that implementation can cause a massive test maintenance headache even if the change is correct and behavior-neutral. For example, I wrote the XSD typeloader for the Bedrock series of releases, which involved a ton of helper classes for handling all the different XSD features. Those classes all have some unit tests, but I put much more effort into testing things from the GScript level on down, which is really the level at which the semantics of the type system are meaningful; everything else is just an implementation detail. As I changed the implementation around (which I did often), those tests would still be valid since they described invariants about how the system should function regardless of the implementation, whereas I’d simply throw out the unit level tests that became invalid. Since I had the higher-level tests, I wasn’t as worried that I’d lose coverage by throwing out those unit-level tests.
One of the more difficult problems in avoiding false positives is around web UI testing: you generally either have to pin your tests to the text on the page, which is naturally pretty fickle, or you have to pin your tests to generated HTML ids, which are also often fairly unstable. Our old, pre-Bedrock UI testing framework had this problem: every time someone renamed a link or changed a button, some large number of tests would break with incomprehensible error messages that were difficult to track down (i.e. is the button “foo” not on the page because it’s been renamed, removed, or because the page is just broken?). We’ve solved a lot of those problems with our new framework (not yet fully baked for customer use at this point, I believe) that exposes a strongly-typed object model for writing tests against the pages, which means that renaming or removing a button will cause related tests to break at compilation time, making it infinitely easier to fix them proactively.
Lastly, good coding practices will generally help you out with reducing false positives, or at least making them easier to identify and fix. Treat your test code like it’s real, production code and pay attention to decomposition, code reuse, variable and method naming, etc. It might not strictly result in fewer false positives, but it will make them easier to fix if the fix only needs to be made in one well-named method instead of in dozens of copy-and-pasted methods littered all over the code base.
These might all seem like obvious rules, but sometimes we overlook the most obvious things, so it never hurts to go back to first principles and ask basic questions like “Will this test break if the product is broken?” or ”
How do I make this test robust in the face of potential implementation changes?”