Engineering As Failure AvoidancePosted: July 25, 2008
The common view of engineering is that you’re building something up: successively adding parts, features, layers, etc. to eventually achieve some sort of functional goal. That general view holds whether you’re building a bridge or a piece of software.
There’s another way to look at engineering, though, and that’s as the art of failure avoidance. When building a bridge, you have to anticipate all the ways that the bridge could fail and plan around them: worst-case weight loads, earthquakes and high winds, extreme heat or cold, etc. A good bridge, in the structural sense, is one that manages to avoid all the possible failure modes. You might think of this as the Anna Karenina rule for engineering: all successful projects are alike (they don’t fail), while all unsuccessful projects fail in their own unique ways.
Good software engineering is the same in that respect, with the difference that software can generally fail in far more ways than a physical structure can thanks to the magic of things like concurrency, state, and combinatorial explosions of possible inputs. So what does it mean to engineer to avoid failure rather than just to build features? Here are a few rules I try to keep in mind.
Test Like You’re Trying To Break It
Engineers just tend not to do the best job at this; you wrote the feature to do X, you test that it does X, and you move on. But the bugs that approach catches are just the easy ones, and the nastier bugs are the ones that are a result of simply not thinking about certain scenarios or combinations of cases. Sure, it does X if you use the feature as intended, but what if you try to abuse it? What if you deliberately put in invalid input, or try to use it at inappropriate times, or otherwise violate any implicit assumptions about when and how the feature will be used? One of the benefits of test-driven development is that it forces you to at least consider those use cases more. In the end, though, one set of eyes generally isn’t good enough, which is why pair programming/testing can help and why you still need some amount of dedicated QA time even if your engineers write all the tests you can think of. But as an engineer, the more you can really try to break the code you’ve just written, the better your tests will be.
No Wishful Thinking
One of the biggest sources of software failure, in my opinion, is simply the fact that software engineers always want to believe that their software works and, given a lack of immediate hard evidence to the contrary, will tend to want to believe that things will work. Optimism in general is good for your morale, but when it crosses over into wishful thinking you get into trouble (i.e. “I’m confident in our ability to deliver on the features we’ve agreed upon” becomes “We haven’t really tested X yet, but I’m pretty sure it’ll work” or “We haven’t really tested that kind of load but I’m pretty sure we can handle it” or “The schedule looks tight but I think we can make it”). It helps to have at least one surly, grumpy pessimist on your team to provide a check against most people’s natural tendancy to want to assume things will work out okay. The rule is generally that if you haven’t demonstrated that it’ll work it probably won’t.
Think About The Worst Case
Another classic engineering failure mode is to only consider the expected or average case and not to plan for or test out the worst case. For example, if you’re displaying a list of items, how many items will that list have on average versus the 95th percentile or absolute worst cases? If the list might have 30 items on average but could have 10000 in a worst case, you’re going to need to design your software such that it performs at least acceptably under the worst case, even if it doesn’t appear that often. It’s easy to conflate “how often X happens” with “how much work I should put in to handle X” but in reality you need to make sure you handle those 1% cases gracefully (which doesn’t necessarily mean “optimally”) even if that doubles the effort.
Optimize For Debugging And Bug-Fixing
No matter what you do your software is going to be buggy (well, maybe not if you’re Don Knuth, but for mere mortals); hopefully your testing procedures allow you to find those problems before they go into production, but an important part of failure avoidance is fixing them when they come up. There are really two sides to that coin: tracking the problem down and fixing the problem. Tracking the problem down generally requires the right set of tools and the right sort of code organization; well-structured code is easier to debug, and explicit code, even if it’s verbose, tends to be much easier to debug than implicit or declarative code where “magic” happens in some incredibly general way that’s hard to put a breakpoint on.
Being able to fix bugs requires something of the same approach. As a company that ships highly configurable products, though, we have an extra set of issues that come up when you ship frameworks and tools. If too much is built into the framework in a way that isn’t controllable by the person writing code on top of the framework, you can end up without a bail-out when bugs do occur. So as much as declarative, implicit programming can be useful, it’s usually good to have an imperative, explicit bail-out when necessary to allow working around shortcomings in the underlying platform.
Consider The Failure Modes
Related to the previous point’s observation that failure is, in some sense, inevitable, it’s important to consider what the failure modes actually are and to design the system in such a way that they’re as benign as possible. For example, if an automated process can make the right decision 90% of the time, ideally you’d like to identify the 10% of the cases where the system can’t figure out the right thing with 100% certainty and kick that out to a user to make the final call. If you’re writing a security framework, you need to consider if you want the inevitable mis-configuration to fail open (such that anyone can access things) or fail closed (such that no one can). In the aforementioned case of a huge list, perhaps you can get away with simply capping the number of entries that can be displayed, or perhaps one page being glacially slow 1% of the time is fine provided you can keep it from slowing the entire server or database to a crawl. In other words, you don’t have to write the perfect system, but you should write one that avoids doing anything you’re not sure is correct, at least in cases where correctness matters.
If You Can’t Get It Right, Don’t Do It
Lastly, some features just weren’t meant to be built. They’re too complex, or too ambigious, or otherwise too hard to get right. Discrection is, as they say, the better part of valor, and it’s important to know the limits of your tools, schedule, and team abilities. It’s almost always preferable to have 50 100% correct features than 100 50% correct features.