The Magic of Zero

Zero is kind of a magic number as far as I’m concerned; it’s the one number that’s never open to debate and never involves a slippery slope.

One of the best things we’ve done with our development process was the introduction of a root “stable” branch a couple of years ago that (in theory) is always kept at zero test breaks.  Prior to that all development across all teams happened essentially in the root branch, meaning that destabilizing checkins from the platform team were competing with the PC or CC team’s attempts to get a preview release out to a customer, and the resulting flood of checkins meant that there were always a lot of test breaks at any given point in time and it wasn’t always clear whose responsibility they were (even though our harness attempts to assign them to individuals).  Trying to get all the breaks fixed generally involved completely halting code checkins temporarily for all developers, and the constantly high-level of breaks made the incentive to fix any given break much lower.

We eventually switched towards smaller child “active” branches living off of a root “stable” branch for the release, with the branches managed by the individual teams and by smaller groups within those teams as appropriate.  Those branches are synced down from stable regularly and pushed back to the stable branch only when they’re at zero test breaks, thus maintaining the zero test break rule.  That change has had two positive effectives on test breaks and fixes.  First of all, since the stable branch is kept (in theory) at zero test breaks, that means that any breaks within a branch are unambiguously the responsibility of that group, which greatly reduces the diffusion of responsibility that occurs when 50 developers share a branch.  Secondly, the number of breaks is often kept lower, preventing major changes from masking smaller breaks and increasing the visibility of any given test break, encouraging people to fix their tests in a more timely manner and generally leaving zero tests breaks in sight at all times.

The benefit of having zero test breaks in the stable branch pays off in the stability of the code; the application teams get to work with more stable versions of the platform code instead of mid-change versions since the platform code has been pulled from the stable branch, the platform team can verify their changes against working application code instead of half-working code, and anyone can take a build for any product off of the stable branch and assume that it’ll basically work as advertised (though it might be a week or two out of date).

Of course, there’s still some kinks in the system.  Sometimes you can’t fix all the tests before you push your code; maybe you’ve changed things so fundamentally that half the tests will need to be rewritten, but that will take months.  Or perhaps someone has just written a test for something that was already broken but untested previously, and you don’t have time to fix it right away.  In those cases, we’ll often comment out tests or, via annotations that our test harness understands, mark the test as “disabled’ (i.e. don’t run it) or a “known break” (i.e. we know it’s broken, there’s a bug filed, and someone’s on it, so don’t really count it as broken).  On top of that, we still have non-deterministic tests that crop up sometimes or environment-specific problems that won’t be caught or won’t even show up until code is pushed to the stable branch.  All of those problems are frustrating, and it’s painful to have to open up backdoors to the “zero broken test” rule, but by tweak the definition of a “broken test” a bit we can at least still come up with a hard, unambiguous rule.

And that, of course, is because zero itself is so unambigious.  If you say “we can’t have more than 10 broken tests,” then it’s a slippery slope:  what if five are really critical breaks?  What if there are 12 broken tests?  If I only have one broken test assigned to me, maybe it’s okay if I just leave it there and assume other people will fix theirs to get us down to the right number.  Zero has none of those problems; it’s unambigious, it doesn’t let anyone off the hook, and it’s impossible to argue with.

In the next development cycle for PolicyCenter, I’m hoping to introduce a similarly-unambiguous policy with regard to bugs that are filed as we develop, though we’ll see how that goes.  Ideally the only acceptable number of open bugs for the release will be zero at any given point in time; bugs will either be deferred to a future release, reclassified as “improvements,” marked as permanently “won’t fix,” be in the process of being fixed, or waiting for the developer to fix some different bug.  Having rules like “fix important bugs now” or “don’t leave too many bugs for the end” are simply too vague and open to interpretation; zero is pretty much the one number that no one gets to argue with.


Five Sprints into SCRUM

We kicked off Sprint 5 yesterday for Application Framework team.

Guidewire development is following SCRUM methodology. However, through all these years, due to various reason, the ideas behind Sprints are not exactly followed. There are many reasons for this, some of which are actually good reasons. However, that does not mean it was the best decision, and some development teams are trying to bring back meaningful Sprints to the development process, including AF team.

So what have I done differently this time?

We ended up using JIRA to track our stories. There are many reasons for this. I think the first one is the kind of work we are doing right now. We are not yet doing active development, but rather fixing bugs for a point release and run performance testing. Since all the bugs are created in JIRA already, using JIRA to track items that are not bugs makes it easy to track all the items we need to do given any Sprint. On the weekly work-from-home day, which each Guidewire employee can choose freely, it is very convenience to go to JIRA to pick the next work to do.

I am still keeping a Sprint board by writing down the JIRAs on the story cards but it is not as effective as I would like it to be. I think one reason is that QAs are verifying the JIRAs on their own schedule. (And the reason for that is some QAs are not part of AF team, because AF work affects other application teams). I know it sounds strang, but that is the situation right now. We are talking about how to get away from this mode and have a real complete independent development teams but before that happens, we will just have to pull it through.

The purpose for Sprint board now is more for daily Sprint meeting, where we talk about what we have achieved yesterday and are planning to do today. I use it to help the team focus and work on only the blocker JIRA or the JIRAs scheduled for the Sprint. Old habits die hard but we are making progress in that direction. When we schedule too many for the Sprint, which has been the case for all the past Sprints, I use Sprint board to figure out what to push to the next Sprint. I have not bee doing this aggressively. Now that I have an idea of our current velocity, I’ll do more now.

I am also changing the Sprint planning format. I am not going through the JIRAs one by one and ask question on them anymore, because the feedback has been that it takes a long time and becomes uninteresting. I think the first reason is that we are not sharing enough to make it a team conversation. Rather, it is me and whoever owns that part of the system talking with each other and figuring out the tasks to do. Even that, because I cannot pair on each and every JIRA, I am not able to track and check that each JIRA is estimated correctly and each JIRA is done within a Sprint. Without following them up and closing the feedback loop, all the work of creating tasks and track them become rather pointless.

So in the Sprint plannig, I now show the JIRA list scheduled, talk about them briefly in groups by the functional area, and track down the estimate after the meeting. I think I will change the estimation to be before the meeting next time, so that I would know how much to schedule for the Sprint.