As I mentioned in my previous post, the PolicyCenter team is moving to a stricter agile process in the next release, with a new focus on short-term iterations, up-front feature agreement, and done doneness. In order to do that, though, we’re going to run into a lot of problems that traditionally derail agile practices; they’re the sort of things the agile literature will tell you not to do because they make things difficult, but that we don’t really have any choice about.
They’re also the sorts of things that, in my experience, tend to make people pretty skeptical about agile’s claims. Reading through the agile literature, you can pretty quickly create a mental picture of the idealized agile project: a small, greenfield, internal development project. Furthermore, there’s often an emphasis in agile on the fact that all the various practices are reinforcing, so you can’t really drop any of them. That often adds up to some serious skepticism from people who aren’t working under those sorts of idealized conditions.
From what I’ve seen and read, there also isn’t a ton of guidance out there for how to apply agile methods in less-than-ideal cases, other than suggestions to try to get closer to the ideal case. Perhaps it’s just that every situation is unique, so maybe there aren’t any generally-applicable rules. To that end, I figured I’d document the sorts of problems we’re running into, and what we’re trying to do to deal with them.
Releasing Software to Multiple Customers
We ship to multiple customers, not just to one, which means that we have to use our product managers as customer proxies rather than having actual on-site customers. What we’ve found over the last few years is that on PolicyCenter the features are so complicated and so contentious that we need to put much more work into up-front agreement about what the features are, rather than relying on that to happen within the iteration. Otherwise, the product manager might need several days to check with customers or confer with other PMs to come up with an answer, or they might feel rushed to make a decision that later needs to be revisited. That means working harder to get prototypes, mockups, and PRDs in front of customers earlier and to spend more time vetting possible implementation options with the development team prior to even attempting to schedule stories.
Long Release Cycles
An ideal agile project releases frequently; maybe even every week, but no longer than every few months. Our release cycles vary between about 9 and 15 months. There’s not much we can do about that, however; despite our best efforts to make the product upgradable, upgrade still isn’t trivial for our customers, and once they’re close to or in production they have no desire to upgrade frequently. That means that frequent releases would just mean more versions we have to support; releasing every 3 months would mean a 4x increase in the number of versions we need to support, which would be absolute suicide. Because of that, we’re stuck releasing relatively infrequently. The clear downsides to that are the need to make long-term release plans, additional date pressure on those long-term plans (because the next release is a year out), and longer feedback cycles. The best we can do is to try to get better real customer feedback as we go (which is difficult given how much they build on top of our application, which means we don’t get the feedback until they do the work), and to deal with the other issues as well as we can.
Long-Term Release Commitments
The long-term release commitments that are forced on us by long release cycles are, unfortunately, pretty unavoidable. Customers that are buying the product need to make sure the investment makes sense for them, and that is often highly dependent on what they’re getting and when. There might be certain key features they need in order for the release to be useful, and they’ll need to be able to budget, staff, and plan their projects, which means they’ll need to know when they can start and when they’ll get a finished product. So we just don’t have the luxury of saying, “we’ll work on things in your priority order, so you’ll get that eventually” or “if it’s not in this release, just wait a month or two;” the next release is a year out, and we’re prioritizing across multiple customers anyway instead of just one.
The best we can do there is to combine good old SWAG estimates with as much risk management as we can stomach to try to come up with a plan for what we can commit to, giving ourselves a huge buffer in order to deal with the inevitable fact that some estimates will be off by 2-3x and that some sorts of issues will always come up mid-cycle. I’m hopeful that a consistent team velocity combined with a release worth of data about how our initial feature-level SWAGs correspond to actual story points will let us do a better job of planning the next release after this one; I’ll let everyone know how that works out a year from now.
Large, Complicated Codebase
PolicyCenter is a really large, really complicated product. So large and so complicated, in fact, that no one person can really understand in detail how everything is supposed to work or how it’s implemented. Agile development relies heavily on shared code ownership in order to do things like incremental design and project-wide refactorings, but those get increasingly difficult as the codebase gets larger. It also leads to slower build and test run times, and it necessitates that we have a dedicated QA group that can focus on the high-level feature interactions that developers will end up missing because they just can’t fit the whole picture in their head. Going forward we’re going to try pair-programming more in order to do a better job of spreading knowledge around the team, but the reality is that we’re going to have to have informal code ownership by small subteams.
The other danger is that the product’s design becomes fragmented and the various parts cease to fit together because no one has the big-picture view. Avoiding that is explicitly my job, and I haven’t entirely figured out yet how I’ll manage to keep doing it as the product gets larger; if I come up with anything novel or interesting I’ll certainly write about it. To me architecture really needs to come out of one or two people’s heads, and those people need to understand the whole system, and that just doesn’t really scale up that well.
Managing a large, complicated product requires a lot of engineers, so our team is currently 15 engineers (and growing), and once you add in PM and QA and Docs the overall team size is in the mid-30’s. That’s not exactly a small team, and most agile practices geared towards small teams don’t work at that size. To combat that, we’ve split the team up into four different cross-functional “pods,” as I described previously, in an effort to try to make agile practices work for a subset of the team. There will still need to be cross-pod communication, and people will probably move between them every so often, but in general we’re optimizing for communication within the pod and using that as the primary level of organization, work assignment and velocity tracking. We’ve tried that sort of organization before with mixed results, but we’ve never really made them cross-functional and given them this much independence; hopefully that will be the missing ingredient that makes them work well and helps the team scale.
Building A Toolkit
The last major issue is that a large portion of what we produce is a toolkit and API for clients to use in customizing the application. Having a large published interface limits the kinds of refactoring we can do, and forces us to do much more thinking to get the API right up front so that we don’t have to change it. To a lesser degree, that’s true of our database schema as well; it’s got to upgrade from release to release in a reasonable amount of time, so if we screw it up too badly we might never really be able to fix it.
That also means we can’t really commit to full incremental design or architecture; we’ve got to have some idea of where we’re going and know if it’s something we can live with long term, because we have a lot less flexibility to fix things we don’t like in future releases. That’s naturally a difficult thing to do, and it really just requires a lot of skill, good taste, and luck. It also means we’ll definitely make mistakes by attempting to anticipate the wrong things, but we don’t really have the option to not think about the long-term implications of what we’re doing.
When I started working at Guidewire, back in 2002, the company was all of 15 people, maybe 10,000 lines of code, and one unreleased product. No one really knew anything about Test Driven Development or unit testing in general and we didn’t really have a continuous integration server, but we did attempt to generally follow the scrum model and have daily sprint meetings, month-long sprints of development work that combined design and implementation (and testing, such as it was before we had any QA folks), and a backlog of work for the release organized in priority order that we’d pull from to plan the next sprint. For a long time, that’s pretty much how we continued to develop: at the start of each sprint people would estimate out what they thought was 20 days of work, and at the end of the sprint we’d see what had actually gotten done, discuss what went well and what we should change, and use that to inform the plan and process for the next sprint.
Eventually we set up an auto-build, started to try things like unit testing, and after several fumbling false starts there we managed to make it a core part of our process and culture. Now, as we’ve mentioned before, we have our own in-house test harness application that manages running 40,000+ automated tests across dozens of branches over a farm of servers.
But somewhere along the line, the scrum process kind of broke down for us, in my opinion. It happened at different points on different teams, and you can point to a lot of factors as the culprit:
- Communication breakdowns as the team got larger
- Increased inaccuracy of estimates as the product(s) got much, much larger and more complicated
- Increased maintenance costs as we increased the number of customers and releases
- Increased maintenance costs in the form of test maintenance
- Poorer estimates, increased complexity, and increased product surface area lead to internal date slippages, putting pressure on everyone to scramble to still meet external date commitments, leading to process breakdowns and increased technical debt
There are probably other factors in there that I’m forgetting, but the upshot was probably a pretty classic software development story: the methodology that worked well with 10 developers, tens of thousands of lines of code, one or two customers, and hardly any maintenance releases didn’t work so well with 50 developers, half a million lines of code, more customers, more releases to maintain, and 4-year-old crufty tests that often did more harm than good.
So what do you do about that? Clearly, we needed to change our development methodology somehow, but for a long time we avoided really looking more seriously at anything like XP: after all, we were already doing lots of testing, refactoring, month-long timeboxed iterations, iteration retrospectives, continuous integration, and maintaining a backlog of work that we pulled from each iteration.
Unfortunately, the agile community generally isn’t too helpful about dealing with real-world situations like ours: large teams, large codebases, years of stacked development, legacy unit tests, multiple customers to please, hard release commitments to be met, no real on-site customers, etc. Most of the literature just kind of tells you to change those things: split the team up, simplify your codebase, make your tests faster and more independent, get customers on-site, avoid hard release commitments, etc. Reading the agile literature can be frustrating at times as a result, and it can be easy to read through it and say “that won’t work for us” because, well, as strictly written it won’t. (I’ll expand on those issues in some later posts).
The result, unfortunately, was that we didn’t end up tweaking our process all that much. We did our best to deal with the test and code maintenance issues, and we attempted to split each product team into smaller “pods” within the team to address some of the management and coordination issues with larger teams. As we did it at the time, however, I don’t think it was a particularly successful approach.
And then came the 3.0 release of PolicyCenter, where we rewrote a huge percentage of the application from the inside out (i.e. without changing the end-user behavior all that much) in an attempt to address some major architectural issues that lead to an explosion in complexity that had made the product buggy and difficult to work on. That kind of cleanup, however, is inherently a huge unknown: you’re not changing functionality, so you can’t really measure progress in terms of end-user changes, the changes were violent enough that going halfway on any of them wasn’t even close to an option, and the changes were also so drastic that most of our existing tests wouldn’t run or compile anymore, meaning that we had to start over from scratch on a lot of our testing efforts (realizing that we didn’t know what to test, incidentally, lead to the creation of the Riki). We attempted to organize into sprints, but the reality was that we had no idea how long things would take, product managers weren’t able to provide much oversight or exert much control, and it was a whole lot of controlled chaos. The fact that we made it out basically on time (as of our revised timeline) with a stable, functional product is a testament to the quality of the team, but there’s no way we could continue to work the way we have for the past year.
Meanwhile, one of Guidewire’s other products, BillingCenter, was being run much differently from the other application teams: they were using a methodology much closer to stricter agile methodologies like XP, with two-week iterations, story cards, point-based estimation, and a focus on getting things “done done” before moving on to the next feature. That was working much better for them than our scrum process ever had for the other teams (except, perhaps, back when we were tiny and had hardly any code), so naturally the rest of the teams have moved to adopt that model. Our ClaimCenter team already has, and PolicyCenter will be in a few weeks when we start on our next release.
Of course, it’s never that easy, and we’ve got the disadvantages of a larger team than BillingCenter, a more complicated product, a more configurable product, much more disparity between our customers, a much larger long-term desired feature set, and a lot of resulting date pressure (both internal and external) around particular features. Unsurprisingly, those are some of the problems that got the PolicyCenter project in trouble in the first place.
Even so, I’m confident a process change will keep us on the right track and will help to alleviate some of the issues that have killed us in the past. So what, specifically, are we doing differently?
- Cross-functional pods – Our original mistake with pods was to only really include development in them. We’ve attempted to re-arrange our seating several times to include PM and QA in with the developers, which has helped, but we’re now going to more formally create sub-teams that officially include PM, QA, development, and (if we can) docs. We’ll reduce cross-pod communication as much as we can, optimize for high-bandwidth communication within the pods, estimate and assign work at the pod level, and do our best to let the teams have latitude to self-organize and experiment with what works best for them.
- Focus on “done done” – We fell into the classic development trap of leaving too much bug-fixing until the end, creating uncertainty, stress, and piling up deep-seated architectural issues until far too late in the cycle. In my view, the lack of doneness is always largely driven by date pressure: with date-based estimation and long-range release plans, developers always want to hit their estimates, and they’ll (often unconsciously) cut corners and skimp on testing to do it. Making “doneness” an explicit, shared criteria ought to fix that, though it’ll slow down our perceived rate of progress (but in the long run increase our actual rate of progress).
- More up-front agreement on features prior to development – PolicyCenter functionality is complicated, hard to get right, and contentious, and it requires much more up-front research, experimentation, and debate than normal features do. In the past, we’ve started working on features before those issues were worked out, and the aforementioned date pressure would make people feel they had to build something even though there wasn’t necessarily agreement on what to build. Doing that work in-process, as it were, was often pretty fatal: the product managers would be rushed and the developers would be frustrated or would just make assumptions. We’re focusing now on using stories and doing more up-front work to figure out what to build so that when it comes time to plan an iteration we only schedule work that’s already been fully agreed upon by all parties.
- Shorter iterations and stricter timeboxing – Four weeks just turns out to be too long to really adjust, which means that our timeboxing was never that strict and priorities would be shifted mid-sprint when new issues came up because people just couldn’t wait. Our lack of up-front agreement pretty much always guaranteed that unexpected issues would crop up as well and ensured that our estimates would be inaccurate and wishful. Moving to two week iterations with more up-front agreement and more reliable estimates should make it possible to better avoid mid-iteration corrections.
- Tracking velocity rather than estimates – Estimating work in days seems natural, but it’s just a horrible, horrible mistake. Doing it meant that we never corrected when our estimates were skewed by maintenance burdens or just chronic optimism about how fast we could work, and combined with a lack of up-front agreement our estimates were usually fairly inaccurate. The real upshot was that the team couldn’t commit to its estimates, further exacerbating the timeboxing problems, and we never really had a great indication of how fast we were actually going since a lack of “done doneness” threw things off as well.
That, of course, is merely my hope for how things will work out. We’re starting development of the next release a couple of weeks from now, using that process, and I’m sure we’ll learn plenty and make plenty of tweaks as we go. I’ll do my best to report back on how it actually works out, what difficulties we find, and what we try to do to overcome them.
As a development organization we’re by no means perfect, though we’re constantly looking for ways to improve, and one of the ways in which we’ve historically had a lot of room for improvement is around internal documentation of requirements and feature specifications. We’ve come up with what we hope will be a much better long-term solution, but before I describe what that solution is, I’d like to rewind and tell the story of how we ended up where we are right now.
Imagine that your company is writing a policy administration system (such a randomly chosen example, I know), and you’d like to know the answer to the question “How do policy renewals need to work?” How would you go about trying to answer it?
Historically, we’ve done our requirements documentation and feature specification in a fairly old-school manner: product management would write up a document describing the requirements for a new feature, development would up a design topic on the wiki for anything that requires some more thought and discussion, and development would write up more of a full specification afterwards describing how things actually work. The end result is that the “requirements” for a given piece of functionality tend to be difficult to discern after the fact: they’re scattered across a bunch of early-stage PM documents that are generally 1) deltas against each other, 2) don’t always resemble what was actually built, 3) tend to have a lot of ambiguities (since they’re written as normal prose), and 4) don’t capture any of the little things discussed and agreed upon by dev, PM, and QA over the course of actually building a feature. The development specs (when they’re actually up to date) tend to describe how things actually are rather than what the underlying business requirements are, are also written as deltas, are written at a semi-arbitrary level of detail, and aren’t written for things like UI functionality, while the design topics serve more to explain why things were implemented as they were. So if you want to know how policy renewals work your best bet is just to ask someone; the information is so scattered, out of date, and incomplete that it’s impossible to piece it back together.
We do a lot of test-driven development, so you might ask “But what about the tests?” The agile philosophy is that the tests can often serve as the documentation, and that’s kind of true of well-written, complete unit tests (I still don’t think that’s 100% true, but that’s a different argument). But the problem is that unit tests themselves are at too low a level to be useful for answering higher-level questions like “How does a policy renewal work?” Even questions like “What happens when I click the ‘Add Vehicle’ button?” are difficult to document via tests because they require an entirely different level of tests than “unit” tests. They require end-to-end tests, and those tests tend to be harder to write and harder to read; they’re also much more difficult to ensure completeness for, since you can’t measure test coverage using a tool or even match up the set of methods against your set of tests. In addition, for infrastructure work the tests tend to help describe the implementation, not the high level requirements.
The other problem with using tests is, unfortunately, that they tend to get deleted when they break too badly; at some point it’s inevitable that some refactoring or other major change will break enough unit tests that you just don’t have the energy or inclination to fix them all right away, so you rewrite and fix what you can and just comment out or delete the ones you can’t. That’s also true of tests that are written against the actual implementation rather than at some higher level of abstraction; if you change the implementation, all those tests are simply irrelevant, so you have no choice but to kill them. That might not be ideal, but practically speaking that’s what actually happens in the real world where real people write real tests, and as such tests are a bit shaky to rely on as the sole source of documentation about business requirements.
“What about story cards?” you might ask. Well, one unfortunate fact is that the policy team that I work on hasn’t used story cards in the past (we will be in the next release cycle). One of our other teams does drive everything off of story cards, but even then I think there are some problems. First of all, stories are inherently deltas, and over the course of a release or over many releases the same functionality is often continuously changed, making it difficult to piece together an answer to “How does policy renewal work?” because doing so requires assembling all the stories relating to renewals over the course of several releases in chronological order so that the appropriate deltas are applied in order. Ouch. Story cards are also inherently somewhat unorganized and can contain information relating to multiple different parts of the system, so just assembling that set of cards in the first place can be difficult. Story cards would still be lightyears ahead of where we were a year ago, so perhaps if we’d had them we wouldn’t have built the tools that we did, but since we didn’t have those cards we had to find a different way to do things.
So that was our situation a year ago: information about how things were supposed to work was largely in people’s heads, and we had scattered, generally untargeted end-to-end test coverage that touched many parts of the system.
That’s around the time we started to rework some major portions of our application, and before we started we thought the main risk we ran was that we’d break things without realizing it. In order to mitigate that, we wanted to fill out all (or at least a good number) of the tests for a given area of the application before we changed things. But how would we know we had “all” the tests and weren’t missing something? Without any obvious “units” to test we’d have no chance, so we decided to make our own units. They weren’t really stories in the traditional sense: they were statements like “The ‘Add Vehicle’ button takes you to the ‘New Vehicle’ page” and “The ‘Clone Vehicle’ button clones all selected vehicles, cloning all of their fields except for the VIN number.” Some of them could have been stories in the story card sense, but plenty of them were too fine-grained for story cards. For lack of a better term, we decided to call them “requirements” instead. Our process then became that we’d first attempt to reverse engineer the requirements for a page before we rewrote it, generally by reading any existing documentation and then by playing around with the page to see what it actually did. After we wrote those down, we’d try to have them reviewed by the product managers for accuracy and completeness, and then we’d use the requirements to drive a set of tests around the page. Ensuring a sufficient level of testing became much easier, because we could target the tests to the requirements just as you’d target unit tests to a method. Once we were done we were pretty sure we’d catch most of the breaks we might introduce, and we’d go ahead with whatever refactoring/rearchitecting needed to happen.
The idea was the right one, I think, but we had questions over how exactly we’d manage the requirements docs. What format would they be in? How would we organize them so people could find them? Hardest of all, how would we ensure they stayed up to date? We really wanted to measure coverage of the requirements as well, so how would we do that? To get the ball rolling we started out just using Google spreadsheets to track the requirements; the spreadsheet format ensured the requirements were relatively small and targeted (and hopefully unambiguous) line items instead of prose paragraphs describing things. I even wrote a way, using annotations and the Google SOAP API, to create some simple HTML reports about what requirements had tests. It was pretty clear that was a sub-optimal solution, but it was a start.
The question really became where to go with things: if we wanted to try to cover our whole application this way and really drive a lot of our automated end-to-end testing off of it, we’d really need everyone on the team to be on board with it, and doing that would probably require some much better tool for managing things. Thankfully we had an engineer who was fairly amazing at coming up with little tools to solve all sorts of development problems, and he agreed to take the lead on formalizing things and later driving adoption of the tool. The end result was basically an addition to MediaWiki that we called the “requirements wiki” and was eventually nicknamed “The Riki”. The modification added in some special tags for listing requirements, which would then (on a first update) assign them unique IDs. It also allows you to tag requirements with labels like “agreed” and “implemented,” along with several other clever things. The IDs can then be used as annotations for test methods to tie the test methods back to the requirements, and the Riki has a background process that periodically takes a build and processes all the annotations to link things up, resulting in the ability to display the current test methods inline with the requirements as well as coverage reports about what percentage of requirements have any tests at all as well as what percentage of the tests have actually been implemented. The latter statistic allows us to add in empty test methods as a way of sketching out a test plan without implementing them all immediately.
The riki is still pretty young in its life, so the jury is still out on our ability to really keep it up to date. So far, though, it’s proven useful as a way to coordinate dev, QA, and PM by giving everyone a shared, authoritative reference point about how things are supposed to work. I’m hopeful that by making it an indispensible part of our development process we’ll manage to overcome the inherent problems with keeping documentation up to date and that it’ll drive clearer, less ambigious requirements, better testing, better communication between dev, QA, and PM, and serve as an ongoing reference for anyone new to the team or to a particular area.
The common view of engineering is that you’re building something up: successively adding parts, features, layers, etc. to eventually achieve some sort of functional goal. That general view holds whether you’re building a bridge or a piece of software.
There’s another way to look at engineering, though, and that’s as the art of failure avoidance. When building a bridge, you have to anticipate all the ways that the bridge could fail and plan around them: worst-case weight loads, earthquakes and high winds, extreme heat or cold, etc. A good bridge, in the structural sense, is one that manages to avoid all the possible failure modes. You might think of this as the Anna Karenina rule for engineering: all successful projects are alike (they don’t fail), while all unsuccessful projects fail in their own unique ways.
Good software engineering is the same in that respect, with the difference that software can generally fail in far more ways than a physical structure can thanks to the magic of things like concurrency, state, and combinatorial explosions of possible inputs. So what does it mean to engineer to avoid failure rather than just to build features? Here are a few rules I try to keep in mind.
Test Like You’re Trying To Break It
Engineers just tend not to do the best job at this; you wrote the feature to do X, you test that it does X, and you move on. But the bugs that approach catches are just the easy ones, and the nastier bugs are the ones that are a result of simply not thinking about certain scenarios or combinations of cases. Sure, it does X if you use the feature as intended, but what if you try to abuse it? What if you deliberately put in invalid input, or try to use it at inappropriate times, or otherwise violate any implicit assumptions about when and how the feature will be used? One of the benefits of test-driven development is that it forces you to at least consider those use cases more. In the end, though, one set of eyes generally isn’t good enough, which is why pair programming/testing can help and why you still need some amount of dedicated QA time even if your engineers write all the tests you can think of. But as an engineer, the more you can really try to break the code you’ve just written, the better your tests will be.
No Wishful Thinking
One of the biggest sources of software failure, in my opinion, is simply the fact that software engineers always want to believe that their software works and, given a lack of immediate hard evidence to the contrary, will tend to want to believe that things will work. Optimism in general is good for your morale, but when it crosses over into wishful thinking you get into trouble (i.e. “I’m confident in our ability to deliver on the features we’ve agreed upon” becomes “We haven’t really tested X yet, but I’m pretty sure it’ll work” or “We haven’t really tested that kind of load but I’m pretty sure we can handle it” or “The schedule looks tight but I think we can make it”). It helps to have at least one surly, grumpy pessimist on your team to provide a check against most people’s natural tendancy to want to assume things will work out okay. The rule is generally that if you haven’t demonstrated that it’ll work it probably won’t.
Think About The Worst Case
Another classic engineering failure mode is to only consider the expected or average case and not to plan for or test out the worst case. For example, if you’re displaying a list of items, how many items will that list have on average versus the 95th percentile or absolute worst cases? If the list might have 30 items on average but could have 10000 in a worst case, you’re going to need to design your software such that it performs at least acceptably under the worst case, even if it doesn’t appear that often. It’s easy to conflate “how often X happens” with “how much work I should put in to handle X” but in reality you need to make sure you handle those 1% cases gracefully (which doesn’t necessarily mean “optimally”) even if that doubles the effort.
Optimize For Debugging And Bug-Fixing
No matter what you do your software is going to be buggy (well, maybe not if you’re Don Knuth, but for mere mortals); hopefully your testing procedures allow you to find those problems before they go into production, but an important part of failure avoidance is fixing them when they come up. There are really two sides to that coin: tracking the problem down and fixing the problem. Tracking the problem down generally requires the right set of tools and the right sort of code organization; well-structured code is easier to debug, and explicit code, even if it’s verbose, tends to be much easier to debug than implicit or declarative code where “magic” happens in some incredibly general way that’s hard to put a breakpoint on.
Being able to fix bugs requires something of the same approach. As a company that ships highly configurable products, though, we have an extra set of issues that come up when you ship frameworks and tools. If too much is built into the framework in a way that isn’t controllable by the person writing code on top of the framework, you can end up without a bail-out when bugs do occur. So as much as declarative, implicit programming can be useful, it’s usually good to have an imperative, explicit bail-out when necessary to allow working around shortcomings in the underlying platform.
Consider The Failure Modes
Related to the previous point’s observation that failure is, in some sense, inevitable, it’s important to consider what the failure modes actually are and to design the system in such a way that they’re as benign as possible. For example, if an automated process can make the right decision 90% of the time, ideally you’d like to identify the 10% of the cases where the system can’t figure out the right thing with 100% certainty and kick that out to a user to make the final call. If you’re writing a security framework, you need to consider if you want the inevitable mis-configuration to fail open (such that anyone can access things) or fail closed (such that no one can). In the aforementioned case of a huge list, perhaps you can get away with simply capping the number of entries that can be displayed, or perhaps one page being glacially slow 1% of the time is fine provided you can keep it from slowing the entire server or database to a crawl. In other words, you don’t have to write the perfect system, but you should write one that avoids doing anything you’re not sure is correct, at least in cases where correctness matters.
If You Can’t Get It Right, Don’t Do It
Lastly, some features just weren’t meant to be built. They’re too complex, or too ambigious, or otherwise too hard to get right. Discrection is, as they say, the better part of valor, and it’s important to know the limits of your tools, schedule, and team abilities. It’s almost always preferable to have 50 100% correct features than 100 50% correct features.
Zero is kind of a magic number as far as I’m concerned; it’s the one number that’s never open to debate and never involves a slippery slope.
One of the best things we’ve done with our development process was the introduction of a root “stable” branch a couple of years ago that (in theory) is always kept at zero test breaks. Prior to that all development across all teams happened essentially in the root branch, meaning that destabilizing checkins from the platform team were competing with the PC or CC team’s attempts to get a preview release out to a customer, and the resulting flood of checkins meant that there were always a lot of test breaks at any given point in time and it wasn’t always clear whose responsibility they were (even though our harness attempts to assign them to individuals). Trying to get all the breaks fixed generally involved completely halting code checkins temporarily for all developers, and the constantly high-level of breaks made the incentive to fix any given break much lower.
We eventually switched towards smaller child “active” branches living off of a root “stable” branch for the release, with the branches managed by the individual teams and by smaller groups within those teams as appropriate. Those branches are synced down from stable regularly and pushed back to the stable branch only when they’re at zero test breaks, thus maintaining the zero test break rule. That change has had two positive effectives on test breaks and fixes. First of all, since the stable branch is kept (in theory) at zero test breaks, that means that any breaks within a branch are unambiguously the responsibility of that group, which greatly reduces the diffusion of responsibility that occurs when 50 developers share a branch. Secondly, the number of breaks is often kept lower, preventing major changes from masking smaller breaks and increasing the visibility of any given test break, encouraging people to fix their tests in a more timely manner and generally leaving zero tests breaks in sight at all times.
The benefit of having zero test breaks in the stable branch pays off in the stability of the code; the application teams get to work with more stable versions of the platform code instead of mid-change versions since the platform code has been pulled from the stable branch, the platform team can verify their changes against working application code instead of half-working code, and anyone can take a build for any product off of the stable branch and assume that it’ll basically work as advertised (though it might be a week or two out of date).
Of course, there’s still some kinks in the system. Sometimes you can’t fix all the tests before you push your code; maybe you’ve changed things so fundamentally that half the tests will need to be rewritten, but that will take months. Or perhaps someone has just written a test for something that was already broken but untested previously, and you don’t have time to fix it right away. In those cases, we’ll often comment out tests or, via annotations that our test harness understands, mark the test as “disabled’ (i.e. don’t run it) or a “known break” (i.e. we know it’s broken, there’s a bug filed, and someone’s on it, so don’t really count it as broken). On top of that, we still have non-deterministic tests that crop up sometimes or environment-specific problems that won’t be caught or won’t even show up until code is pushed to the stable branch. All of those problems are frustrating, and it’s painful to have to open up backdoors to the “zero broken test” rule, but by tweak the definition of a “broken test” a bit we can at least still come up with a hard, unambiguous rule.
And that, of course, is because zero itself is so unambigious. If you say “we can’t have more than 10 broken tests,” then it’s a slippery slope: what if five are really critical breaks? What if there are 12 broken tests? If I only have one broken test assigned to me, maybe it’s okay if I just leave it there and assume other people will fix theirs to get us down to the right number. Zero has none of those problems; it’s unambigious, it doesn’t let anyone off the hook, and it’s impossible to argue with.
In the next development cycle for PolicyCenter, I’m hoping to introduce a similarly-unambiguous policy with regard to bugs that are filed as we develop, though we’ll see how that goes. Ideally the only acceptable number of open bugs for the release will be zero at any given point in time; bugs will either be deferred to a future release, reclassified as “improvements,” marked as permanently “won’t fix,” be in the process of being fixed, or waiting for the developer to fix some different bug. Having rules like “fix important bugs now” or “don’t leave too many bugs for the end” are simply too vague and open to interpretation; zero is pretty much the one number that no one gets to argue with.
I’ve been thinking a lot lately, and having lots of discussions, about what our development process should be for the next release of PolicyCenter. I’ve taken the potentially-controversial position that date targets and deadlines are detrimental to developing a software project, and it’s a position that I feel many non-developers don’t really understand.
The intuitive position that I’m arguing against is what I think of as the term paper philosophy; having the deadline there forces you to get it together to crank through the thing, whereas without any firm deadline you’d keep working on it forever (or keep putting off working on it), and moreover you wouldn’t really work very hard on it either. Some people need that deadline pressure and the rush of adrenaline that accompanies the fear of not getting things done, but even people that don’t thrive on it often benefit from having a deadline to focus their efforts.
What I’m saying is that that theory doesn’t apply very well to long-lived software projects. It might apply to toy projects in college that you never work on again, but the two fundamental differences between software and a term paper (or nearly anything else you try to build on a deadline) are that you keep having to work on the software project and there are far more corners to cut in software development.
It’s hard to over-emphasize the significance of that statement, and I think that non-developers really don’t have a very good analog of what that means. There just aren’t that many other fields or endeavors in which something is built up successively over years or even decades of work. There are even fewer where you can release an intermediate product off of that tree periodically that people will actually use (and expect support and upgrades for), and fewer still where you can “successfully” cut as many corners as you can in software development without having things completely fall apart in the short term. The only analogy I can think of might be construction work, where if you rush the foundations of a building then the future stories won’t fare so well, but it might not be obvious that’s the case until you start on those future stories or until an earthquake or some other disaster hits. But most sorts of projects or tasks that people do are kind of one-time things; if you keep using them, you use them as they were when they were finished rather than attempting to build more and bigger things on top of your initial work. If you rush building a chair, maybe the chair’s a little off-balance, but it’s not going to affect future chairs that you build. And of course, that’s all combined with the fact that software estimation for anything other than the immediate task at hand is basically impossible, meaning it’s hard to avoid deadline pressure by just estimating accurately, since you’re always trying to estimate something that’s fundamentally pretty unknowable.
What sorts of corners can you cut that will make the product look “done” but will come back to haunt you later? Here are a few examples:
- No or incomplete tests
- Ignoring or not considering edge cases or error conditions
- Ignoring or not considering feature interactions
- Poor UI design
- Poor datamodelling
- Poor API design
- Improper encapsulation, decomposition, or otherwise messy code
- Inconsistency between different areas of the code
- No, poor, or inaccurate documentation and specs
All of those things will inevitably bite you later down the line; some of them will bite your customers too, some of them will make future changes nearly impossible or far more difficult, some will cause your development effort to grind to a halt in the future, and others will be fixable but just require more work to do later than they would have to do right initially. Nearly all of them will eventually slow down development, putting even more pressure on future deadlines and leading to a vicious cycle where even more short-term hacks enter the system.
So let’s imagine a real-world scenario. You’ve got what you think is 3 months worth of work to do, but one month in less than 1/3 of it (as far as you can tell) is done. What do you think will happen? People will feel behind, and they’ll pull whatever strings they can to try to try to make up the lost time, either consciously or unconsciously. The same thing will happen with more short-term deadlines if you ask someone to try to get something done by the end of the week: they’ll find a way to try to make it happen, even if they should really take more time. In the long run, you’re going to pay dearly for those sorts of decisions.
So what’s the solution? I’m honestly not sure, though I have some theories I want to try out. There are two obvious problems with not having deadlines at all. First of all, some people really do need deadline pressure to get things done, and others simply will go on working on less important things forever in that case. My personal opinion is that those issues can be dealt with without formal deadlines, by constantly re-estimating and re-prioritizing work instead. The second problem is harder to escape, at least if you’re selling your products to people: everyone you’re selling to expects you to tell them what you’ll have done and when it will be done, so it’s inescapable that you’ll end up making some level of date-based feature commitments that then have the potential to put deadline pressure on your team and cause them to cut corners. Some types of agile methodologies can hopefully mitigate those problems for at least part of your development cycle, but most of those work better for consulting-style projects that actually have an end date at some point, and I honestly don’t feel like it’s a solved problem yet for our type of development work. As a company, we do a very good job of (and take immense pride in) standing behind the commitments to our customers, but we can definitely do a better job internally of mitigating the pressure that such commitments put on the development team. I’m hopeful we’ll be trying some new things out in the next release cycle; whether those experiments work out or not, hopefully I’ll get a chance to write about them and tell everyone whether they worked out and why.