A tester’s view

As this is my first post on this blog, let me introduce myself and explain why I joined Guidewire. First, this blog had a lot to do with it – it gave me a pretty good idea what sort of folks I would be working with and made me excited enough to apply. Historically, most of this blog has focused on production code development. My intention is to shed some light on our testing adventures.

Most of my professional experience has been as a developer. Initially, I built business apps, but then more recently I gravitated towards building tools for both developers and testers. I was lucky to get into Agile development early and had a chance to work with many of the pioneers in the field. I especially fell in love with test-driven development and still maintain the http://testdriven.com site.

One of the tools I built was a framework for developing acceptance tests in java. I had the great fortune to present it at the Austin Workshop on Test Automation (http://www.pettichord.com/awta5.html ) which is where I was first introduced to Exploratory Testing ( http://en.wikipedia.org/wiki/Exploratory_testing ). Guidewire turns out to be a great place to practice exploratory testing because of the inherent complexity of the applications, the fact that Guidewire has been Agile from day one, and the copious amounts of unit and acceptance tests that the developers write. Guidewire products combine core business functionality and a full-fledged development environment so the testing needs are diverse and challenging on both the business and technical axes.

Now you may wonder: Why would a professional developer want to focus on exploratory testing? – isn’t it manual testing, which is considered the simplest and least well compensated of testing disciplines? Hopefully, if you read the wiki article above, this prejudice will begin to dissolve for you as it did for me. It’s true that I do a fair amount of manual testing, but it’s what we call ”brain-engaged testing”, and I only do manually what is either not worth automating or impossible to automate. If anyone has automated ”thinking”, please contact me privately so we can make billions off this discovery. 🙂

As a team, we do automate many of the data generation, data collection, and data processing tasks that support exploratory testing. Of course, we do all sorts of testing at Guidewire. We collect metrics and perform root cause analysis to tune our testing mix based on actual bug discovery data. One of the promising directions for us is high-volume, pseudo-random testing. We are reusing some of the tools and tests developed as part of UI-driven acceptance testing to accomplish this. I hope to share some of the results of this experiment in the near future. Bye for now.


The Birth of the Requirements Wiki

As a development organization we’re by no means perfect, though we’re constantly looking for ways to improve, and one of the ways in which we’ve historically had a lot of room for improvement is around internal documentation of requirements and feature specifications.  We’ve come up with what we hope will be a much better long-term solution, but before I describe what that solution is, I’d like to rewind and tell the story of how we ended up where we are right now.

Imagine that your company is writing a policy administration system (such a randomly chosen example, I know), and you’d like to know the answer to the question “How do policy renewals need to work?”  How would you go about trying to answer it?

Historically, we’ve done our requirements documentation and feature specification in a fairly old-school manner:  product management would write up a document describing the requirements for a new feature, development would up a design topic on the wiki for anything that requires some more thought and discussion, and development would write up more of a full specification afterwards describing how things actually work.  The end result is that the “requirements” for a given piece of functionality tend to be difficult to discern after the fact:  they’re scattered across a bunch of early-stage PM documents that are generally 1) deltas against each other, 2) don’t always resemble what was actually built, 3) tend to have a lot of ambiguities (since they’re written as normal prose), and 4) don’t capture any of the little things discussed and agreed upon by dev, PM, and QA over the course of actually building a feature.  The development specs (when they’re actually up to date) tend to describe how things actually are rather than what the underlying business requirements are, are also written as deltas, are written at a semi-arbitrary level of detail, and aren’t written for things like UI functionality, while the design topics serve more to explain why things were implemented as they were.  So if you want to know how policy renewals work your best bet is just to ask someone; the information is so scattered, out of date, and incomplete that it’s impossible to piece it back together.

We do a lot of test-driven development, so you might ask “But what about the tests?”  The agile philosophy is that the tests can often serve as the documentation, and that’s kind of true of well-written, complete unit tests (I still don’t think that’s 100% true, but that’s a different argument).  But the problem is that unit tests themselves are at too low a level to be useful for answering higher-level questions like “How does a policy renewal work?”  Even questions like “What happens when I click the ‘Add Vehicle’ button?” are difficult to document via tests because they require an entirely different level of tests than “unit” tests.  They require end-to-end tests, and those tests tend to be harder to write and harder to read; they’re also much more difficult to ensure completeness for, since you can’t measure test coverage using a tool or even match up the set of methods against your set of tests.  In addition, for infrastructure work the tests tend to help describe the implementation, not the high level requirements.

The other problem with using tests is, unfortunately, that they tend to get deleted when they break too badly; at some point it’s inevitable that some refactoring or other major change will break enough unit tests that you just don’t have the energy or inclination to fix them all right away, so you rewrite and fix what you can and just comment out or delete the ones you can’t.  That’s also true of tests that are written against the actual implementation rather than at some higher level of abstraction; if you change the implementation, all those tests are simply irrelevant, so you have no choice but to kill them.  That might not be ideal, but practically speaking that’s what actually happens in the real world where real people write real tests, and as such tests are a bit shaky to rely on as the sole source of documentation about business requirements.

“What about story cards?” you might ask.  Well, one unfortunate fact is that the policy team that I work on hasn’t used story cards in the past (we will be in the next release cycle).  One of our other teams does drive everything off of story cards, but even then I think there are some problems.  First of all, stories are inherently deltas, and over the course of a release or over many releases the same functionality is often continuously changed, making it difficult to piece together an answer to “How does policy renewal work?” because doing so requires assembling all the stories relating to renewals over the course of several releases in chronological order so that the appropriate deltas are applied in order.  Ouch.  Story cards are also inherently somewhat unorganized and can contain information relating to multiple different parts of the system, so just assembling that set of cards in the first place can be difficult.  Story cards would still be lightyears ahead of where we were a year ago, so perhaps if we’d had them we wouldn’t have built the tools that we did, but since we didn’t have those cards we had to find a different way to do things.

So that was our situation a year ago:  information about how things were supposed to work was largely in people’s heads, and we had scattered, generally untargeted end-to-end test coverage that touched many parts of the system.

That’s around the time we started to rework some major portions of our application, and before we started we thought the main risk we ran was that we’d break things without realizing it.  In order to mitigate that, we wanted to fill out all (or at least a good number) of the tests for a given area of the application before we changed things.  But how would we know we had “all” the tests and weren’t missing something?  Without any obvious “units” to test we’d have no chance, so we decided to make our own units.  They weren’t really stories in the traditional sense:  they were statements like “The ‘Add Vehicle’ button takes you to the ‘New Vehicle’ page” and “The ‘Clone Vehicle’ button clones all selected vehicles, cloning all of their fields except for the VIN number.”  Some of them could have been stories in the story card sense, but plenty of them were too fine-grained for story cards.  For lack of a better term, we decided to call them “requirements” instead.  Our process then became that we’d first attempt to reverse engineer the requirements for a page before we rewrote it, generally by reading any existing documentation and then by playing around with the page to see what it actually did.  After we wrote those down, we’d try to have them reviewed by the product managers for accuracy and completeness, and then we’d use the requirements to drive a set of tests around the page.  Ensuring a sufficient level of testing became much easier, because we could target the tests to the requirements just as you’d target unit tests to a method.  Once we were done we were pretty sure we’d catch most of the breaks we might introduce, and we’d go ahead with whatever refactoring/rearchitecting needed to happen.

The idea was the right one, I think, but we had questions over how exactly we’d manage the requirements docs.  What format would they be in?  How would we organize them so people could find them?  Hardest of all, how would we ensure they stayed up to date?  We really wanted to measure coverage of the requirements as well, so how would we do that?  To get the ball rolling we started out just using Google spreadsheets to track the requirements; the spreadsheet format ensured the requirements were relatively small and targeted (and hopefully unambiguous) line items instead of prose paragraphs describing things.  I even wrote a way, using annotations and the Google SOAP API, to create some simple HTML reports about what requirements had tests.  It was pretty clear that was a sub-optimal solution, but it was a start.

The question really became where to go with things:  if we wanted to try to cover our whole application this way and really drive a lot of our automated end-to-end testing off of it, we’d really need everyone on the team to be on board with it, and doing that would probably require some much better tool for managing things.  Thankfully we had an engineer who was fairly amazing at coming up with little tools to solve all sorts of development problems, and he agreed to take the lead on formalizing things and later driving adoption of the tool.  The end result was basically an addition to MediaWiki that we called the “requirements wiki” and was eventually nicknamed “The Riki”.  The modification added in some special tags for listing requirements, which would then (on a first update) assign them unique IDs.  It also allows you to tag requirements with labels like “agreed” and “implemented,” along with several other clever things.  The IDs can then be used as annotations for test methods to tie the test methods back to the requirements, and the Riki has a background process that periodically takes a build and processes all the annotations to link things up, resulting in the ability to display the current test methods inline with the requirements as well as coverage reports about what percentage of requirements have any tests at all as well as what percentage of the tests have actually been implemented.  The latter statistic allows us to add in empty test methods as a way of sketching out a test plan without implementing them all immediately.

The riki is still pretty young in its life, so the jury is still out on our ability to really keep it up to date.  So far, though, it’s proven useful as a way to coordinate dev, QA, and PM by giving everyone a shared, authoritative reference point about how things are supposed to work.  I’m hopeful that by making it an indispensible part of our development process we’ll manage to overcome the inherent problems with keeping documentation up to date and that it’ll drive clearer, less ambigious requirements, better testing, better communication between dev, QA, and PM, and serve as an ongoing reference for anyone new to the team or to a particular area.


Software Is Not A Term Paper

I’ve been thinking a lot lately, and having lots of discussions, about what our development process should be for the next release of PolicyCenter. I’ve taken the potentially-controversial position that date targets and deadlines are detrimental to developing a software project, and it’s a position that I feel many non-developers don’t really understand.

The intuitive position that I’m arguing against is what I think of as the term paper philosophy; having the deadline there forces you to get it together to crank through the thing, whereas without any firm deadline you’d keep working on it forever (or keep putting off working on it), and moreover you wouldn’t really work very hard on it either. Some people need that deadline pressure and the rush of adrenaline that accompanies the fear of not getting things done, but even people that don’t thrive on it often benefit from having a deadline to focus their efforts.

What I’m saying is that that theory doesn’t apply very well to long-lived software projects. It might apply to toy projects in college that you never work on again, but the two fundamental differences between software and a term paper (or nearly anything else you try to build on a deadline) are that you keep having to work on the software project and there are far more corners to cut in software development.

It’s hard to over-emphasize the significance of that statement, and I think that non-developers really don’t have a very good analog of what that means. There just aren’t that many other fields or endeavors in which something is built up successively over years or even decades of work. There are even fewer where you can release an intermediate product off of that tree periodically that people will actually use (and expect support and upgrades for), and fewer still where you can “successfully” cut as many corners as you can in software development without having things completely fall apart in the short term. The only analogy I can think of might be construction work, where if you rush the foundations of a building then the future stories won’t fare so well, but it might not be obvious that’s the case until you start on those future stories or until an earthquake or some other disaster hits. But most sorts of projects or tasks that people do are kind of one-time things; if you keep using them, you use them as they were when they were finished rather than attempting to build more and bigger things on top of your initial work. If you rush building a chair, maybe the chair’s a little off-balance, but it’s not going to affect future chairs that you build. And of course, that’s all combined with the fact that software estimation for anything other than the immediate task at hand is basically impossible, meaning it’s hard to avoid deadline pressure by just estimating accurately, since you’re always trying to estimate something that’s fundamentally pretty unknowable.

What sorts of corners can you cut that will make the product look “done” but will come back to haunt you later? Here are a few examples:

  • No or incomplete tests
  • Ignoring or not considering edge cases or error conditions
  • Ignoring or not considering feature interactions
  • Poor UI design
  • Poor datamodelling
  • Poor API design
  • Improper encapsulation, decomposition, or otherwise messy code
  • Inconsistency between different areas of the code
  • No, poor, or inaccurate documentation and specs

All of those things will inevitably bite you later down the line; some of them will bite your customers too, some of them will make future changes nearly impossible or far more difficult, some will cause your development effort to grind to a halt in the future, and others will be fixable but just require more work to do later than they would have to do right initially. Nearly all of them will eventually slow down development, putting even more pressure on future deadlines and leading to a vicious cycle where even more short-term hacks enter the system.

So let’s imagine a real-world scenario. You’ve got what you think is 3 months worth of work to do, but one month in less than 1/3 of it (as far as you can tell) is done. What do you think will happen? People will feel behind, and they’ll pull whatever strings they can to try to try to make up the lost time, either consciously or unconsciously. The same thing will happen with more short-term deadlines if you ask someone to try to get something done by the end of the week: they’ll find a way to try to make it happen, even if they should really take more time. In the long run, you’re going to pay dearly for those sorts of decisions.

So what’s the solution? I’m honestly not sure, though I have some theories I want to try out. There are two obvious problems with not having deadlines at all. First of all, some people really do need deadline pressure to get things done, and others simply will go on working on less important things forever in that case. My personal opinion is that those issues can be dealt with without formal deadlines, by constantly re-estimating and re-prioritizing work instead. The second problem is harder to escape, at least if you’re selling your products to people: everyone you’re selling to expects you to tell them what you’ll have done and when it will be done, so it’s inescapable that you’ll end up making some level of date-based feature commitments that then have the potential to put deadline pressure on your team and cause them to cut corners. Some types of agile methodologies can hopefully mitigate those problems for at least part of your development cycle, but most of those work better for consulting-style projects that actually have an end date at some point, and I honestly don’t feel like it’s a solved problem yet for our type of development work. As a company, we do a very good job of (and take immense pride in) standing behind the commitments to our customers, but we can definitely do a better job internally of mitigating the pressure that such commitments put on the development team. I’m hopeful we’ll be trying some new things out in the next release cycle; whether those experiments work out or not, hopefully I’ll get a chance to write about them and tell everyone whether they worked out and why.


A Typical Developer Day

Here’s a typical day from the perspective of a BillingCenter developer (link for LOLcat fans). Hopefully, it demonstrates some of my favorite things about the environment at Guidewire: emphasis on tests, Agile-style feedback, code branching, and an appetite for humility:

9:00: As I do each day before my commute, I pull code from the Stable branch to my team’s Active branch via a shell script. Every team has at least one Active branch on which they submit code changes and manage tests, and code from each of these Active branches is pushed to the Stable branch once a week. The Stable branch is our zero-test-break version of source code and a crown jewel of our company, with which we can release code to customers on a weekly basis… few non-Web2.0 companies can do this.

9:30-12:00: Since we completed our 2.0 feature cycle, we’ve shifted focus to refining existing functionality before we officially ship a more polished product. I am currently working on a story for refactoring the APIs for determining commission rates. This one comes from our Services professionals, who are a great driver of feature refinement. Reading through their correspondence, it looks like they had become confused by the half-dozen methods called getCommissionRate() and about which method did what. This occasionally happens in large software projects, but frankly it’s embarrassing and definitely needs refactoring. I spend the rest of the morning thinking through the best way to organize a clean, refactored, and centralized method.

11:00-11:10: My full team stand for our daily sprint meeting, during which we briefly tell what we did the previous day and what we planned to do that day. At 10 devs, 3 PMs, 6 QA, and 1 documenter, my team has doubled since 1.0 development.

12:30-5:30: Having consulted with our Services people and my team lead, I’ve decided to push all commission rate-related logic into a configurable GScript method. This means that in addition to removing most of the Java implementations of getCommissionRate(), I repackage logic from those Java methods to a file that is externally read-writable so that Services and customers will never again wonder which version of getCommissionRate() does what.

Normally, QA is involved with the story from start to finish since there are tests to write before we consider ourselves done with the feature, but in the case of this refactor there were no new tests to write – thanks to TTD (test-driven development) and QA for test coverage.

5:30: I learn from our test harness, which had just completely a cycle, that I had broken a few tests with my most recent code check-in. From the snapshot stack traces, the fix looks fairly easy. Usually, I would quickly submit a fix before finishing my work day, but it’s time for our weekly Thirsty Thursday social hour. This is always a good time to catch up with coworkers and discuss what’s new with, say, GScript or Studio (internally developed IDE for GScript and webpage editing) – we tend to keep talking about software and Guidewire except over beers.