Bad Testing Idea #357: Automatic Suite Partitioning

Here at Guidewire we’ve been attempting to do automated developer testing of one sort or another for probably about 7 1/2 of the 8 years I’ve been here, and in that time we’ve come up with a lot of bad ideas. As it turns out, writing tests that can then be run and maintained for years across multiple major versions of a product is really, really hard. There are a bunch of different ways to fail: writing tests so narrowly that they don’t accurately test the application (i.e. the tests still pass even though the application is broken), writing tests that are incredibly fragile and require huge amounts of maintenance, writing tests that are non-deterministic, writing tests that run differently in your automated harness versus on a developer’s machine, writing tests that run too slowly . . . the list is endless, and we’re in a fairly continuous cycle of adjusting how we write tests as we learn more about what works and what doesn’t.

Today, though, I’d like to call out one particular bad idea we’ve had, in the hopes that I can discourage anyone else from ever trying it: automatically partitioning large, long-running test suites so they can easily run in parallel.

Those unit-test evangelists among you might scoff at the premise of the idea: “Why would you ever have suites that take that long to run?” you might ask. Well, given that this is the real world and all . . . stuff happens. When you step out of the world of true “unit” tests and move into integration testing, things start to slow down a bit . . . and when you start actually testing your UI, there’s really no hope. A suite of 5000 UI-level tests simply isn’t going to run in any reasonable amount of time, no matter what sort of technology you’re talking about. (If there is some technology that can test a web client, or a desktop client for that matter, with an average test running time of < 0.1s, some please correct me . . . but anyone that’s ever used Selenium or SWTBot will probably be lucky if their tests execute in 1s per test on average).

As with all such ideas, this one started innocently enough. At first we didn’t have enough tests to need to split them up: we’d have a test suite for all the domain logic for an application, and a test suite for the UI, and they’d each take a few minutes to run. But as we added more tests, and more logic, the suites started to take longer and longer, so the logical thing was to split the tests up into suites so they could be run in parallel. So how did we go about doing that? Well, like the engineers that we are, we came up with an engineering solution: parallelization is something that should be done automatically by the framework, not something you should have to think about, right? So at first, we just created suites named things like CCServerTestSuite1, CCServerTestSuite2, through CCServerTestSuiteN, and the suite had some simple logic: find all the tests that could be in the suite, divide the number of classes by the number of suites, segment the test classes into N buckets, and then pick the appropriate bucket. So if there were 10 tests and 5 suites, Suite1 would run tests in classes 1 and 2, Suite2 would run tests in classes 3 and 4, and so on.

Of course, that was annoying because we had to manually monitor the number of tests, the running times of the suites, and add suites as they got too slow. So we turned the crank once more and changed our test harness to itself automatically partition a suite into N pieces, where N could be dynamically determined based on the actual running time of the suites, with the number adjusted up or down to try to stick near a target suite running time.

With the benefit of hindsight, I can now confidently say that all of that was just a terrible, terrible idea. Now in an ideal world, where tests within a test suite had no chance of interacting, this wouldn’t be nearly as bad. After all, that’s how you’re supposed to write tests, right? Well, sure . . . but as always, reality intervenes, and it turns out that for certain classes of tests (those darn integration tests again), it’s hard to ensure there are no interactions. If you’re testing the search functionality of the UI, you’d better be sure you know what’s in the database prior to the test executing, and that no prior test in the suite has mucked things up. You’d also better be sure you don’t have any static variables or other shared state that gets modified by any of your tests. Again, having tests not interact with each other is Testing 101 sort of stuff, but in practice it can often be difficult to 100% ensure it doesn’t happen (aside from running each test in isolation), and when you do inevitably screw it up you won’t notice until you have the right combination of tests running in the right order in your test suite. As a result, you can have latent test interactions that only show up as tests shift get re-ordered. To make matters even worse, our test suites have historically been fairly heterogeneous, meaning that the tests themselves require the system to be in different states prior to their execution, and the test framework is responsible for making the necessary system changes prior to the execution of the test . . . but that code is also not infallible. As a result, certain issues will show up depending on which tests execute first in a suite and thus perform the initial system setup.

So what happens when you add in automatic suite partitioning? Suppose again that we’ve got 10 tests, Test1 through Test10, and initially we split the suite into two partitions. Partition 1 includes Test1 through Test5, while partition 2 includes Test6 through Test10. Now suppose that you add two more tests; suddenly, partition 1 includes Test1 through Test6, while partition 2 contains Test7 through Test12. All of a sudden, Test6 now runs after five other tests, rather than as the first test in a suite. If Test6 interacts with any of the first five tests, that issue will only show up after it moves over to partition 1, and you’ll end up with a test break showing up in your continuous test harness that coincides with merely checking in additional tests. Given that people are adding, removing, and modifying tests all the time, that sort of shift of tests from one partition to another happens basically constantly in our test harness. In the absolute worst case, the new combination of tests results in some sort of memory leak, deadlock, or other problem that ends up killing the entire test suite partition rather than just resulting in a failed test.

There’s one other, less catastrophic problem with automatic test partitioning, which is that the partitions end up being lumpy in terms of their running time, rather than consistently even. Tests, especially integration and UI tests, can vary widely in terms of their running time, so merely ordering tests by name and then chopping them up into evenly-sized (in terms of number of classes) partitions doesn’t ensure that the partitions themselves will be even in terms of running time. It’s fairly common for our partitions, for example, to vary in running time between 5 and 35 minutes, simply because the automatic partition splits end up lumping together a bunch of slow-running tests. The testing turnaround time for a code branch is, naturally, bottlenecked by the slowest-running suite partition, so having lumpy suite execution time merely means that we spend more time waiting for tests to finish running prior to pushing or pulling a branch; not catastrophic, but certainly not ideal either.

So what’s the solution? Well, the first obvious solution is to make the tests run fast enough that you don’t need to partition them. That’s a whole lot easier to do if that’s an explicit design goal of your testing efforts from the start, and a whole lot harder to do if you’ve ignored test speed problems over the years because you assume you can just run the tests in parallel anyway. Barring that, I much prefer to explicitly group tests together in suites based on whatever sort of categorization makes sense (ideally functional area), explicitly controlling which tests run with which other tests so that the interactions between tests within a suite are at least stable (i.e. Test6 always runs before Test7, and never runs after Test5 because it’s always in a different suite) and so that the suites can be chopped up in a way that gives them a more consistent running time.

There’s one other possible approach worth mentioning, which is to completely isolate the tests somehow. If each test runs entirely on its own against a freshly-started server, a fresh VM, a fresh database, a fresh browser session, etc., there’s no chance of interaction between tests, and you can parallelize at the level of the individual test class. Unfortunately, doing that for integration tests that require a significant amount of one-time setup (i.e. starting up a server, initializing the database, etc.) in a way that’s performant in a world without infinite processing resources is . . . difficult. One of our developers has done some experiments internally around trying to make that happen using virtual machines, but we haven’t yet managed to develop the technique to the point where we can realistically do it for our entire test harness.

So for now, at least, we’re moving back to explicitly organizing suites and away from automatic test suite partitioning. It was a noble experiment, but one that I ultimately consider to have failed.