The Magic of Zero

Zero is kind of a magic number as far as I’m concerned; it’s the one number that’s never open to debate and never involves a slippery slope.

One of the best things we’ve done with our development process was the introduction of a root “stable” branch a couple of years ago that (in theory) is always kept at zero test breaks.  Prior to that all development across all teams happened essentially in the root branch, meaning that destabilizing checkins from the platform team were competing with the PC or CC team’s attempts to get a preview release out to a customer, and the resulting flood of checkins meant that there were always a lot of test breaks at any given point in time and it wasn’t always clear whose responsibility they were (even though our harness attempts to assign them to individuals).  Trying to get all the breaks fixed generally involved completely halting code checkins temporarily for all developers, and the constantly high-level of breaks made the incentive to fix any given break much lower.

We eventually switched towards smaller child “active” branches living off of a root “stable” branch for the release, with the branches managed by the individual teams and by smaller groups within those teams as appropriate.  Those branches are synced down from stable regularly and pushed back to the stable branch only when they’re at zero test breaks, thus maintaining the zero test break rule.  That change has had two positive effectives on test breaks and fixes.  First of all, since the stable branch is kept (in theory) at zero test breaks, that means that any breaks within a branch are unambiguously the responsibility of that group, which greatly reduces the diffusion of responsibility that occurs when 50 developers share a branch.  Secondly, the number of breaks is often kept lower, preventing major changes from masking smaller breaks and increasing the visibility of any given test break, encouraging people to fix their tests in a more timely manner and generally leaving zero tests breaks in sight at all times.

The benefit of having zero test breaks in the stable branch pays off in the stability of the code; the application teams get to work with more stable versions of the platform code instead of mid-change versions since the platform code has been pulled from the stable branch, the platform team can verify their changes against working application code instead of half-working code, and anyone can take a build for any product off of the stable branch and assume that it’ll basically work as advertised (though it might be a week or two out of date).

Of course, there’s still some kinks in the system.  Sometimes you can’t fix all the tests before you push your code; maybe you’ve changed things so fundamentally that half the tests will need to be rewritten, but that will take months.  Or perhaps someone has just written a test for something that was already broken but untested previously, and you don’t have time to fix it right away.  In those cases, we’ll often comment out tests or, via annotations that our test harness understands, mark the test as “disabled’ (i.e. don’t run it) or a “known break” (i.e. we know it’s broken, there’s a bug filed, and someone’s on it, so don’t really count it as broken).  On top of that, we still have non-deterministic tests that crop up sometimes or environment-specific problems that won’t be caught or won’t even show up until code is pushed to the stable branch.  All of those problems are frustrating, and it’s painful to have to open up backdoors to the “zero broken test” rule, but by tweak the definition of a “broken test” a bit we can at least still come up with a hard, unambiguous rule.

And that, of course, is because zero itself is so unambigious.  If you say “we can’t have more than 10 broken tests,” then it’s a slippery slope:  what if five are really critical breaks?  What if there are 12 broken tests?  If I only have one broken test assigned to me, maybe it’s okay if I just leave it there and assume other people will fix theirs to get us down to the right number.  Zero has none of those problems; it’s unambigious, it doesn’t let anyone off the hook, and it’s impossible to argue with.

In the next development cycle for PolicyCenter, I’m hoping to introduce a similarly-unambiguous policy with regard to bugs that are filed as we develop, though we’ll see how that goes.  Ideally the only acceptable number of open bugs for the release will be zero at any given point in time; bugs will either be deferred to a future release, reclassified as “improvements,” marked as permanently “won’t fix,” be in the process of being fixed, or waiting for the developer to fix some different bug.  Having rules like “fix important bugs now” or “don’t leave too many bugs for the end” are simply too vague and open to interpretation; zero is pretty much the one number that no one gets to argue with.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s