As I mentioned in my previous post, the PolicyCenter team is moving to a stricter agile process in the next release, with a new focus on short-term iterations, up-front feature agreement, and done doneness. In order to do that, though, we’re going to run into a lot of problems that traditionally derail agile practices; they’re the sort of things the agile literature will tell you not to do because they make things difficult, but that we don’t really have any choice about.
They’re also the sorts of things that, in my experience, tend to make people pretty skeptical about agile’s claims. Reading through the agile literature, you can pretty quickly create a mental picture of the idealized agile project: a small, greenfield, internal development project. Furthermore, there’s often an emphasis in agile on the fact that all the various practices are reinforcing, so you can’t really drop any of them. That often adds up to some serious skepticism from people who aren’t working under those sorts of idealized conditions.
From what I’ve seen and read, there also isn’t a ton of guidance out there for how to apply agile methods in less-than-ideal cases, other than suggestions to try to get closer to the ideal case. Perhaps it’s just that every situation is unique, so maybe there aren’t any generally-applicable rules. To that end, I figured I’d document the sorts of problems we’re running into, and what we’re trying to do to deal with them.
Releasing Software to Multiple Customers
We ship to multiple customers, not just to one, which means that we have to use our product managers as customer proxies rather than having actual on-site customers. What we’ve found over the last few years is that on PolicyCenter the features are so complicated and so contentious that we need to put much more work into up-front agreement about what the features are, rather than relying on that to happen within the iteration. Otherwise, the product manager might need several days to check with customers or confer with other PMs to come up with an answer, or they might feel rushed to make a decision that later needs to be revisited. That means working harder to get prototypes, mockups, and PRDs in front of customers earlier and to spend more time vetting possible implementation options with the development team prior to even attempting to schedule stories.
Long Release Cycles
An ideal agile project releases frequently; maybe even every week, but no longer than every few months. Our release cycles vary between about 9 and 15 months. There’s not much we can do about that, however; despite our best efforts to make the product upgradable, upgrade still isn’t trivial for our customers, and once they’re close to or in production they have no desire to upgrade frequently. That means that frequent releases would just mean more versions we have to support; releasing every 3 months would mean a 4x increase in the number of versions we need to support, which would be absolute suicide. Because of that, we’re stuck releasing relatively infrequently. The clear downsides to that are the need to make long-term release plans, additional date pressure on those long-term plans (because the next release is a year out), and longer feedback cycles. The best we can do is to try to get better real customer feedback as we go (which is difficult given how much they build on top of our application, which means we don’t get the feedback until they do the work), and to deal with the other issues as well as we can.
Long-Term Release Commitments
The long-term release commitments that are forced on us by long release cycles are, unfortunately, pretty unavoidable. Customers that are buying the product need to make sure the investment makes sense for them, and that is often highly dependent on what they’re getting and when. There might be certain key features they need in order for the release to be useful, and they’ll need to be able to budget, staff, and plan their projects, which means they’ll need to know when they can start and when they’ll get a finished product. So we just don’t have the luxury of saying, “we’ll work on things in your priority order, so you’ll get that eventually” or “if it’s not in this release, just wait a month or two;” the next release is a year out, and we’re prioritizing across multiple customers anyway instead of just one.
The best we can do there is to combine good old SWAG estimates with as much risk management as we can stomach to try to come up with a plan for what we can commit to, giving ourselves a huge buffer in order to deal with the inevitable fact that some estimates will be off by 2-3x and that some sorts of issues will always come up mid-cycle. I’m hopeful that a consistent team velocity combined with a release worth of data about how our initial feature-level SWAGs correspond to actual story points will let us do a better job of planning the next release after this one; I’ll let everyone know how that works out a year from now.
Large, Complicated Codebase
PolicyCenter is a really large, really complicated product. So large and so complicated, in fact, that no one person can really understand in detail how everything is supposed to work or how it’s implemented. Agile development relies heavily on shared code ownership in order to do things like incremental design and project-wide refactorings, but those get increasingly difficult as the codebase gets larger. It also leads to slower build and test run times, and it necessitates that we have a dedicated QA group that can focus on the high-level feature interactions that developers will end up missing because they just can’t fit the whole picture in their head. Going forward we’re going to try pair-programming more in order to do a better job of spreading knowledge around the team, but the reality is that we’re going to have to have informal code ownership by small subteams.
The other danger is that the product’s design becomes fragmented and the various parts cease to fit together because no one has the big-picture view. Avoiding that is explicitly my job, and I haven’t entirely figured out yet how I’ll manage to keep doing it as the product gets larger; if I come up with anything novel or interesting I’ll certainly write about it. To me architecture really needs to come out of one or two people’s heads, and those people need to understand the whole system, and that just doesn’t really scale up that well.
Managing a large, complicated product requires a lot of engineers, so our team is currently 15 engineers (and growing), and once you add in PM and QA and Docs the overall team size is in the mid-30’s. That’s not exactly a small team, and most agile practices geared towards small teams don’t work at that size. To combat that, we’ve split the team up into four different cross-functional “pods,” as I described previously, in an effort to try to make agile practices work for a subset of the team. There will still need to be cross-pod communication, and people will probably move between them every so often, but in general we’re optimizing for communication within the pod and using that as the primary level of organization, work assignment and velocity tracking. We’ve tried that sort of organization before with mixed results, but we’ve never really made them cross-functional and given them this much independence; hopefully that will be the missing ingredient that makes them work well and helps the team scale.
Building A Toolkit
The last major issue is that a large portion of what we produce is a toolkit and API for clients to use in customizing the application. Having a large published interface limits the kinds of refactoring we can do, and forces us to do much more thinking to get the API right up front so that we don’t have to change it. To a lesser degree, that’s true of our database schema as well; it’s got to upgrade from release to release in a reasonable amount of time, so if we screw it up too badly we might never really be able to fix it.
That also means we can’t really commit to full incremental design or architecture; we’ve got to have some idea of where we’re going and know if it’s something we can live with long term, because we have a lot less flexibility to fix things we don’t like in future releases. That’s naturally a difficult thing to do, and it really just requires a lot of skill, good taste, and luck. It also means we’ll definitely make mistakes by attempting to anticipate the wrong things, but we don’t really have the option to not think about the long-term implications of what we’re doing.
When I started working at Guidewire, back in 2002, the company was all of 15 people, maybe 10,000 lines of code, and one unreleased product. No one really knew anything about Test Driven Development or unit testing in general and we didn’t really have a continuous integration server, but we did attempt to generally follow the scrum model and have daily sprint meetings, month-long sprints of development work that combined design and implementation (and testing, such as it was before we had any QA folks), and a backlog of work for the release organized in priority order that we’d pull from to plan the next sprint. For a long time, that’s pretty much how we continued to develop: at the start of each sprint people would estimate out what they thought was 20 days of work, and at the end of the sprint we’d see what had actually gotten done, discuss what went well and what we should change, and use that to inform the plan and process for the next sprint.
Eventually we set up an auto-build, started to try things like unit testing, and after several fumbling false starts there we managed to make it a core part of our process and culture. Now, as we’ve mentioned before, we have our own in-house test harness application that manages running 40,000+ automated tests across dozens of branches over a farm of servers.
But somewhere along the line, the scrum process kind of broke down for us, in my opinion. It happened at different points on different teams, and you can point to a lot of factors as the culprit:
- Communication breakdowns as the team got larger
- Increased inaccuracy of estimates as the product(s) got much, much larger and more complicated
- Increased maintenance costs as we increased the number of customers and releases
- Increased maintenance costs in the form of test maintenance
- Poorer estimates, increased complexity, and increased product surface area lead to internal date slippages, putting pressure on everyone to scramble to still meet external date commitments, leading to process breakdowns and increased technical debt
There are probably other factors in there that I’m forgetting, but the upshot was probably a pretty classic software development story: the methodology that worked well with 10 developers, tens of thousands of lines of code, one or two customers, and hardly any maintenance releases didn’t work so well with 50 developers, half a million lines of code, more customers, more releases to maintain, and 4-year-old crufty tests that often did more harm than good.
So what do you do about that? Clearly, we needed to change our development methodology somehow, but for a long time we avoided really looking more seriously at anything like XP: after all, we were already doing lots of testing, refactoring, month-long timeboxed iterations, iteration retrospectives, continuous integration, and maintaining a backlog of work that we pulled from each iteration.
Unfortunately, the agile community generally isn’t too helpful about dealing with real-world situations like ours: large teams, large codebases, years of stacked development, legacy unit tests, multiple customers to please, hard release commitments to be met, no real on-site customers, etc. Most of the literature just kind of tells you to change those things: split the team up, simplify your codebase, make your tests faster and more independent, get customers on-site, avoid hard release commitments, etc. Reading the agile literature can be frustrating at times as a result, and it can be easy to read through it and say “that won’t work for us” because, well, as strictly written it won’t. (I’ll expand on those issues in some later posts).
The result, unfortunately, was that we didn’t end up tweaking our process all that much. We did our best to deal with the test and code maintenance issues, and we attempted to split each product team into smaller “pods” within the team to address some of the management and coordination issues with larger teams. As we did it at the time, however, I don’t think it was a particularly successful approach.
And then came the 3.0 release of PolicyCenter, where we rewrote a huge percentage of the application from the inside out (i.e. without changing the end-user behavior all that much) in an attempt to address some major architectural issues that lead to an explosion in complexity that had made the product buggy and difficult to work on. That kind of cleanup, however, is inherently a huge unknown: you’re not changing functionality, so you can’t really measure progress in terms of end-user changes, the changes were violent enough that going halfway on any of them wasn’t even close to an option, and the changes were also so drastic that most of our existing tests wouldn’t run or compile anymore, meaning that we had to start over from scratch on a lot of our testing efforts (realizing that we didn’t know what to test, incidentally, lead to the creation of the Riki). We attempted to organize into sprints, but the reality was that we had no idea how long things would take, product managers weren’t able to provide much oversight or exert much control, and it was a whole lot of controlled chaos. The fact that we made it out basically on time (as of our revised timeline) with a stable, functional product is a testament to the quality of the team, but there’s no way we could continue to work the way we have for the past year.
Meanwhile, one of Guidewire’s other products, BillingCenter, was being run much differently from the other application teams: they were using a methodology much closer to stricter agile methodologies like XP, with two-week iterations, story cards, point-based estimation, and a focus on getting things “done done” before moving on to the next feature. That was working much better for them than our scrum process ever had for the other teams (except, perhaps, back when we were tiny and had hardly any code), so naturally the rest of the teams have moved to adopt that model. Our ClaimCenter team already has, and PolicyCenter will be in a few weeks when we start on our next release.
Of course, it’s never that easy, and we’ve got the disadvantages of a larger team than BillingCenter, a more complicated product, a more configurable product, much more disparity between our customers, a much larger long-term desired feature set, and a lot of resulting date pressure (both internal and external) around particular features. Unsurprisingly, those are some of the problems that got the PolicyCenter project in trouble in the first place.
Even so, I’m confident a process change will keep us on the right track and will help to alleviate some of the issues that have killed us in the past. So what, specifically, are we doing differently?
- Cross-functional pods – Our original mistake with pods was to only really include development in them. We’ve attempted to re-arrange our seating several times to include PM and QA in with the developers, which has helped, but we’re now going to more formally create sub-teams that officially include PM, QA, development, and (if we can) docs. We’ll reduce cross-pod communication as much as we can, optimize for high-bandwidth communication within the pods, estimate and assign work at the pod level, and do our best to let the teams have latitude to self-organize and experiment with what works best for them.
- Focus on “done done” – We fell into the classic development trap of leaving too much bug-fixing until the end, creating uncertainty, stress, and piling up deep-seated architectural issues until far too late in the cycle. In my view, the lack of doneness is always largely driven by date pressure: with date-based estimation and long-range release plans, developers always want to hit their estimates, and they’ll (often unconsciously) cut corners and skimp on testing to do it. Making “doneness” an explicit, shared criteria ought to fix that, though it’ll slow down our perceived rate of progress (but in the long run increase our actual rate of progress).
- More up-front agreement on features prior to development – PolicyCenter functionality is complicated, hard to get right, and contentious, and it requires much more up-front research, experimentation, and debate than normal features do. In the past, we’ve started working on features before those issues were worked out, and the aforementioned date pressure would make people feel they had to build something even though there wasn’t necessarily agreement on what to build. Doing that work in-process, as it were, was often pretty fatal: the product managers would be rushed and the developers would be frustrated or would just make assumptions. We’re focusing now on using stories and doing more up-front work to figure out what to build so that when it comes time to plan an iteration we only schedule work that’s already been fully agreed upon by all parties.
- Shorter iterations and stricter timeboxing – Four weeks just turns out to be too long to really adjust, which means that our timeboxing was never that strict and priorities would be shifted mid-sprint when new issues came up because people just couldn’t wait. Our lack of up-front agreement pretty much always guaranteed that unexpected issues would crop up as well and ensured that our estimates would be inaccurate and wishful. Moving to two week iterations with more up-front agreement and more reliable estimates should make it possible to better avoid mid-iteration corrections.
- Tracking velocity rather than estimates – Estimating work in days seems natural, but it’s just a horrible, horrible mistake. Doing it meant that we never corrected when our estimates were skewed by maintenance burdens or just chronic optimism about how fast we could work, and combined with a lack of up-front agreement our estimates were usually fairly inaccurate. The real upshot was that the team couldn’t commit to its estimates, further exacerbating the timeboxing problems, and we never really had a great indication of how fast we were actually going since a lack of “done doneness” threw things off as well.
That, of course, is merely my hope for how things will work out. We’re starting development of the next release a couple of weeks from now, using that process, and I’m sure we’ll learn plenty and make plenty of tweaks as we go. I’ll do my best to report back on how it actually works out, what difficulties we find, and what we try to do to overcome them.
As a development organization we’re by no means perfect, though we’re constantly looking for ways to improve, and one of the ways in which we’ve historically had a lot of room for improvement is around internal documentation of requirements and feature specifications. We’ve come up with what we hope will be a much better long-term solution, but before I describe what that solution is, I’d like to rewind and tell the story of how we ended up where we are right now.
Imagine that your company is writing a policy administration system (such a randomly chosen example, I know), and you’d like to know the answer to the question “How do policy renewals need to work?” How would you go about trying to answer it?
Historically, we’ve done our requirements documentation and feature specification in a fairly old-school manner: product management would write up a document describing the requirements for a new feature, development would up a design topic on the wiki for anything that requires some more thought and discussion, and development would write up more of a full specification afterwards describing how things actually work. The end result is that the “requirements” for a given piece of functionality tend to be difficult to discern after the fact: they’re scattered across a bunch of early-stage PM documents that are generally 1) deltas against each other, 2) don’t always resemble what was actually built, 3) tend to have a lot of ambiguities (since they’re written as normal prose), and 4) don’t capture any of the little things discussed and agreed upon by dev, PM, and QA over the course of actually building a feature. The development specs (when they’re actually up to date) tend to describe how things actually are rather than what the underlying business requirements are, are also written as deltas, are written at a semi-arbitrary level of detail, and aren’t written for things like UI functionality, while the design topics serve more to explain why things were implemented as they were. So if you want to know how policy renewals work your best bet is just to ask someone; the information is so scattered, out of date, and incomplete that it’s impossible to piece it back together.
We do a lot of test-driven development, so you might ask “But what about the tests?” The agile philosophy is that the tests can often serve as the documentation, and that’s kind of true of well-written, complete unit tests (I still don’t think that’s 100% true, but that’s a different argument). But the problem is that unit tests themselves are at too low a level to be useful for answering higher-level questions like “How does a policy renewal work?” Even questions like “What happens when I click the ‘Add Vehicle’ button?” are difficult to document via tests because they require an entirely different level of tests than “unit” tests. They require end-to-end tests, and those tests tend to be harder to write and harder to read; they’re also much more difficult to ensure completeness for, since you can’t measure test coverage using a tool or even match up the set of methods against your set of tests. In addition, for infrastructure work the tests tend to help describe the implementation, not the high level requirements.
The other problem with using tests is, unfortunately, that they tend to get deleted when they break too badly; at some point it’s inevitable that some refactoring or other major change will break enough unit tests that you just don’t have the energy or inclination to fix them all right away, so you rewrite and fix what you can and just comment out or delete the ones you can’t. That’s also true of tests that are written against the actual implementation rather than at some higher level of abstraction; if you change the implementation, all those tests are simply irrelevant, so you have no choice but to kill them. That might not be ideal, but practically speaking that’s what actually happens in the real world where real people write real tests, and as such tests are a bit shaky to rely on as the sole source of documentation about business requirements.
“What about story cards?” you might ask. Well, one unfortunate fact is that the policy team that I work on hasn’t used story cards in the past (we will be in the next release cycle). One of our other teams does drive everything off of story cards, but even then I think there are some problems. First of all, stories are inherently deltas, and over the course of a release or over many releases the same functionality is often continuously changed, making it difficult to piece together an answer to “How does policy renewal work?” because doing so requires assembling all the stories relating to renewals over the course of several releases in chronological order so that the appropriate deltas are applied in order. Ouch. Story cards are also inherently somewhat unorganized and can contain information relating to multiple different parts of the system, so just assembling that set of cards in the first place can be difficult. Story cards would still be lightyears ahead of where we were a year ago, so perhaps if we’d had them we wouldn’t have built the tools that we did, but since we didn’t have those cards we had to find a different way to do things.
So that was our situation a year ago: information about how things were supposed to work was largely in people’s heads, and we had scattered, generally untargeted end-to-end test coverage that touched many parts of the system.
That’s around the time we started to rework some major portions of our application, and before we started we thought the main risk we ran was that we’d break things without realizing it. In order to mitigate that, we wanted to fill out all (or at least a good number) of the tests for a given area of the application before we changed things. But how would we know we had “all” the tests and weren’t missing something? Without any obvious “units” to test we’d have no chance, so we decided to make our own units. They weren’t really stories in the traditional sense: they were statements like “The ‘Add Vehicle’ button takes you to the ‘New Vehicle’ page” and “The ‘Clone Vehicle’ button clones all selected vehicles, cloning all of their fields except for the VIN number.” Some of them could have been stories in the story card sense, but plenty of them were too fine-grained for story cards. For lack of a better term, we decided to call them “requirements” instead. Our process then became that we’d first attempt to reverse engineer the requirements for a page before we rewrote it, generally by reading any existing documentation and then by playing around with the page to see what it actually did. After we wrote those down, we’d try to have them reviewed by the product managers for accuracy and completeness, and then we’d use the requirements to drive a set of tests around the page. Ensuring a sufficient level of testing became much easier, because we could target the tests to the requirements just as you’d target unit tests to a method. Once we were done we were pretty sure we’d catch most of the breaks we might introduce, and we’d go ahead with whatever refactoring/rearchitecting needed to happen.
The idea was the right one, I think, but we had questions over how exactly we’d manage the requirements docs. What format would they be in? How would we organize them so people could find them? Hardest of all, how would we ensure they stayed up to date? We really wanted to measure coverage of the requirements as well, so how would we do that? To get the ball rolling we started out just using Google spreadsheets to track the requirements; the spreadsheet format ensured the requirements were relatively small and targeted (and hopefully unambiguous) line items instead of prose paragraphs describing things. I even wrote a way, using annotations and the Google SOAP API, to create some simple HTML reports about what requirements had tests. It was pretty clear that was a sub-optimal solution, but it was a start.
The question really became where to go with things: if we wanted to try to cover our whole application this way and really drive a lot of our automated end-to-end testing off of it, we’d really need everyone on the team to be on board with it, and doing that would probably require some much better tool for managing things. Thankfully we had an engineer who was fairly amazing at coming up with little tools to solve all sorts of development problems, and he agreed to take the lead on formalizing things and later driving adoption of the tool. The end result was basically an addition to MediaWiki that we called the “requirements wiki” and was eventually nicknamed “The Riki”. The modification added in some special tags for listing requirements, which would then (on a first update) assign them unique IDs. It also allows you to tag requirements with labels like “agreed” and “implemented,” along with several other clever things. The IDs can then be used as annotations for test methods to tie the test methods back to the requirements, and the Riki has a background process that periodically takes a build and processes all the annotations to link things up, resulting in the ability to display the current test methods inline with the requirements as well as coverage reports about what percentage of requirements have any tests at all as well as what percentage of the tests have actually been implemented. The latter statistic allows us to add in empty test methods as a way of sketching out a test plan without implementing them all immediately.
The riki is still pretty young in its life, so the jury is still out on our ability to really keep it up to date. So far, though, it’s proven useful as a way to coordinate dev, QA, and PM by giving everyone a shared, authoritative reference point about how things are supposed to work. I’m hopeful that by making it an indispensible part of our development process we’ll manage to overcome the inherent problems with keeping documentation up to date and that it’ll drive clearer, less ambigious requirements, better testing, better communication between dev, QA, and PM, and serve as an ongoing reference for anyone new to the team or to a particular area.
Zero is kind of a magic number as far as I’m concerned; it’s the one number that’s never open to debate and never involves a slippery slope.
One of the best things we’ve done with our development process was the introduction of a root “stable” branch a couple of years ago that (in theory) is always kept at zero test breaks. Prior to that all development across all teams happened essentially in the root branch, meaning that destabilizing checkins from the platform team were competing with the PC or CC team’s attempts to get a preview release out to a customer, and the resulting flood of checkins meant that there were always a lot of test breaks at any given point in time and it wasn’t always clear whose responsibility they were (even though our harness attempts to assign them to individuals). Trying to get all the breaks fixed generally involved completely halting code checkins temporarily for all developers, and the constantly high-level of breaks made the incentive to fix any given break much lower.
We eventually switched towards smaller child “active” branches living off of a root “stable” branch for the release, with the branches managed by the individual teams and by smaller groups within those teams as appropriate. Those branches are synced down from stable regularly and pushed back to the stable branch only when they’re at zero test breaks, thus maintaining the zero test break rule. That change has had two positive effectives on test breaks and fixes. First of all, since the stable branch is kept (in theory) at zero test breaks, that means that any breaks within a branch are unambiguously the responsibility of that group, which greatly reduces the diffusion of responsibility that occurs when 50 developers share a branch. Secondly, the number of breaks is often kept lower, preventing major changes from masking smaller breaks and increasing the visibility of any given test break, encouraging people to fix their tests in a more timely manner and generally leaving zero tests breaks in sight at all times.
The benefit of having zero test breaks in the stable branch pays off in the stability of the code; the application teams get to work with more stable versions of the platform code instead of mid-change versions since the platform code has been pulled from the stable branch, the platform team can verify their changes against working application code instead of half-working code, and anyone can take a build for any product off of the stable branch and assume that it’ll basically work as advertised (though it might be a week or two out of date).
Of course, there’s still some kinks in the system. Sometimes you can’t fix all the tests before you push your code; maybe you’ve changed things so fundamentally that half the tests will need to be rewritten, but that will take months. Or perhaps someone has just written a test for something that was already broken but untested previously, and you don’t have time to fix it right away. In those cases, we’ll often comment out tests or, via annotations that our test harness understands, mark the test as “disabled’ (i.e. don’t run it) or a “known break” (i.e. we know it’s broken, there’s a bug filed, and someone’s on it, so don’t really count it as broken). On top of that, we still have non-deterministic tests that crop up sometimes or environment-specific problems that won’t be caught or won’t even show up until code is pushed to the stable branch. All of those problems are frustrating, and it’s painful to have to open up backdoors to the “zero broken test” rule, but by tweak the definition of a “broken test” a bit we can at least still come up with a hard, unambiguous rule.
And that, of course, is because zero itself is so unambigious. If you say “we can’t have more than 10 broken tests,” then it’s a slippery slope: what if five are really critical breaks? What if there are 12 broken tests? If I only have one broken test assigned to me, maybe it’s okay if I just leave it there and assume other people will fix theirs to get us down to the right number. Zero has none of those problems; it’s unambigious, it doesn’t let anyone off the hook, and it’s impossible to argue with.
In the next development cycle for PolicyCenter, I’m hoping to introduce a similarly-unambiguous policy with regard to bugs that are filed as we develop, though we’ll see how that goes. Ideally the only acceptable number of open bugs for the release will be zero at any given point in time; bugs will either be deferred to a future release, reclassified as “improvements,” marked as permanently “won’t fix,” be in the process of being fixed, or waiting for the developer to fix some different bug. Having rules like “fix important bugs now” or “don’t leave too many bugs for the end” are simply too vague and open to interpretation; zero is pretty much the one number that no one gets to argue with.
The goal of quality software development is to release a product which will meet or exceed customer expectations. Sounds simple, but few software companies (especially those developing enterprise applications) ever achieve this goal and even fewer do so on a consistent basis. From it’s inception Guidewire adopted aspects of Agile Development and Extreme Programming with the objective of high quality and on-time, customer-relevant releases. My ambition with this blog entry is to share the experience of the QA team working in Guidewire’s Agile model with the hope of sharing lessons learned, how we work, and what we strive for.
There’s no QA in XP
QA and Agile/XP are not natural bedfellows. This issue has been discussed ad nauseum in various forums (see http://www.theserverside.com/news/thread.tss?thread_id=38785 for a relatively entertaining discourse on the matter). In synopsis, Kent Beck (the creator of Extreme Programming) did not see a role for a QA team or testers in general. In From Extreme Programming Explained Beck stated: “An XP tester is not a separate person, dedicated to breaking the system and humiliating the programmers.” Rather, in Agile/XP, programmers develop extensive unit tests and the end-customers decide whether the resulting product (or feature) is acceptable or not. In fact, the founders of Guidewire debated whether or not to have a QA team, based both on the ideals of XP in addition to negative experiences with the effectiveness of QA teams at prior companies. However, Guidewire has consistently maintained a 2-to-1 developer to QA Engineer ratio, a very high investment in QA as compared to the software industry as a whole. Why the disconnect? Is Guidewire really an Agile/XP company? Is Agile/XP too idealistic?
My 2 cents
My first assessment is that Guidewire indeed does try to emulate Agile/XP goals. My second assessment is that strict adherence to Agile/XP is simply unrealistic. Following are several reasons why Guidewire has required a dedicated QA effort:
- Unit testing, test-first development, and pair-programming fit into that category of ideas that most everyone agrees are worth the investment. However, the intense dedication involved to successfully and consistently implement XP practices means the reality of XP projects often fall short of ambitions.
- Unit tests generally fail to take into account both integration between features as well as the final packaged customer deliverable and the various environments the product will operate within. Without an effort to exercise the application in it’s end-customer state, including against all supported platforms, too many issues will be missed by unit testing alone.
- From a psychological point-of-view it’s difficult to objectively critique your own creation (i.e. your own code). An interesting case study would be to contrast the tests a developer implements versus tests defined by a QA Engineer. It’s likely the developer unit tests would cover the obvious use cases that the code was designed for, but how likely is it that the tests would exercise the uglier aspects that lurk in the boundaries?
- Automation (specifically unit tests) can not always be applied to all features or situations. Some amount of manual testing is realistically unavoidable.
- QA Engineers live in a world of higher abstraction than a typical developer and are exposed to a broader view of the product and it’s public requirements. Considerations that are obvious at the customer-level are often not made aware to a developer focused on a specific block of code.
- The customer feedback loop is not always ideal. Agile depends on customers willing to invest heavily in the product development effort, communicating to development their priorities and whether or not the product being built fits their needs. Many customers are unwilling or unequipped to make this investment. It’s arguable whether this is the best decision on the customer’s part, but the reality is that in Agile you’re asking customers to engage in what amounts to beta testing on steroids. As a result, some intermediary is necessary (namely QA) to determine if a feature meets it’s stated requirements.
- Finally, in the world of mission critical insurance applications it is simply unacceptable to rely on the end customer to discover product issues. Of course some bugs will exist in a release which will be caught by customers, but keeping the number of issues to a minimum is key to successful deployments and content customers.
Hmmm. We do need QA. Now, what is QA again?
So, given that QA has shown to be justified at Guidewire, what has proven to be it’s most effective application? First off, I feel a little guilty using “we” when describing the QA team. A separate QA organization truly only belongs in a waterfall development model where testing is a stand-alone stage and code is delivered wholesale from development to QA. In addition, a team expressly associated with ensuring quality inherently defeats the purpose of the Agile/XP process where testing/quality should be addressed by everyone in the organization and at all points of development. My belief is that implementation of QA in an Agile/XP environment should follow general Agile/XP tenets (when in Rome…), namely establishing and maintaining core Agile/XP ideals, specifically a passion for continual improvement. Following are some guiding principles that I feel are inherent to any successful QA effort within an Agile/XP environment:
- Never outsource QA. Luckily, this has never been an issue at Guidewire. The fact is that no salary differential (nor no communication device) will compensate for the collaboration that occurs when Engineers sit in the same room with no barriers to conversation.
- There is no substitute for a good Engineer. Hiring is key.
- Strive for tightly integrated development and QA teams. Ideally, QA Engineers sit next to their developer counterparts and the testing effort is shared and occurs in lockstep with code development. As well, the automated testing infrastructure should be common. At Guidewire we have what may be the ultimate automation solution. Tests developed by QA are run 24/7 by a harness which assigns broken tests to those who are responsible for their regression (usually development). Thus, a valid automated test is maintained ad infinitum. Simply checkin your test and walk away…
- Make holistic quality-related decisions. Involving Development and Product Management with decisions impacting testing resources allows for more effective use of limited time. QA should focus on areas known to be of high risk, for example new features where unit testing is known to be lacking or the code base is likely to exhibit buggy behavior based on inherent complexity. As well, the likely customer impact of a bug in a given area (knowledge usually unique to PM) is valuable in terms of determining whether or not that feature deserves special attention.
- Establish a model of continual training and leverage your knowledge base to keep Engineers up-to-date. At Guidewire we send all QA Engineers through training courses developed for Field Engineers. The expense is rather large (3 weeks of full-time training) but the payoff is Engineers exposed to the entire product and it’s customer-facing interface. Without this training it would likely take years for each Engineer to attain the same broad base of product knowledge.
- Develop good tests, whether manual or automated. A bad test (defined as either being redundant, poorly defined, trivial, or at worst deceiving) is expensive in terms of maintenance and misrepresented confidence levels.
- Automate, automate, automate. Guidewire strives for 100% automated test coverage. This is a brash goal and oftentimes the reality is far from ideal, but if you don’t shoot for the moon…
- Treat test code as production code. Follow good coding conventions, comment well, and refactor each test such that it remains relevant. This is another example of a pinnacle of testing that is difficult to reach.
Hopefully this is a decent primer on the often misunderstood and historically maligned area of software development called Quality Assurance, especially as applied to Agile. There are many topics that apply to quality software development I’d like to eventually delve into or expand upon. Test coverage is a fascinating pseudo-science that’s fun to debate. The evolution of the QA Engineer from a key-banging monkey to a fully-fledged Object-Oriented programmer is interesting, as well (especially from a staffing point-of-view where said QA programmers are as rare as an early Triassic mammal). I would also like to further explore whether it may in fact be a healthy goal of a development organization to reach a state where QA is superfluous. In addition I’d like to cover what perhaps is the greatest challenge of Agile development – scaling what works well on a small team to a much larger organization. Finally, it’s fun for me to reminisce on the history of the Guidewire QA team, from pure manual testing to a nascent Test Harness and limited test automation, to the world today where GScript tests and the ToolsHarness infrastructure allow for near limitless automation potential.