⇥ To except is human; to handle is divine.
If you write code, you’re writing bugs—it’s a simple fact of life, and the sooner a developer accepts it, the sooner he or she can work towards mitigating the damage that mistakes and errors inevitably carry with themselves.
After many more years in this industry than I care to admit, it still surprises me how little attention developers pay to error management—and, let’s be clear, I don’t mean preventing errors, but managing them.
Errors as Opportunities
When an error occurs, the vast majority of the web-based application code that I see during my reviews performs the software equivalent of running around with its head cut off: the developer spends an inordinate amount of time and resources trying to make the software look like what was essentially a catastrophic failure was nothing more than a small temporary hiccup. Enter Error-500 pages that attempt to redirect, provide excuses and, generally, cover up the fact that something went horribly, horribly wrong.
In reality, by the time an error has occurred, there are only two possible outcomes: either you expected the error to occur, in which case you have already written code to handle the failure, or you didn’t, in which case your main focus should be to use the error as a learning opportunity.
Doing the Right Thing
When I write web code, if an unexpected error occurs, I actually want the software to fail. The reasoning is simple: if an abnormal condition occurs in the application, I don’t want any of its possible side effects to carry over into other areas of the program and potentially cause even more problems. Thus, I spend little time actually handling the error is very short: I redirect the user to a simple page that communicates the fact that a temporary error condition has occurred. I don’t even bother trying to recover, and I certainly don’t advocate that the user try again—the last thing that I want is more damage.
On the other hand, I spend a considerable amount of time collecting information. Logging an error is not nearly enough: you need to be able to examine as much of the conditions that surrounded its occurrence. Generally, I collect information about:
- All the local and global variables present at the time of the crash
- The entire HTTP environment—GET and POST data, cookies, Apache environment variables, and so forth
- Breadcrumbs that indicate how the use came upon the particular document that caused the error
- Some critical information on the overall health of the hardware—machine load, memory usage, and so forth
The idea is that the more information you have on hand, the more likely you will be to understand how the error occurred and, therefore, either fix the bug that caused it or write code to turn it from an unexpected error condition to an expected one.
Timeliness is Close to Godliness
In addition to collecting information about an error, you also need to make sure that the IT support team gets notified of it within the appropriate timeframe. At my company, when we write software we often categorize exceptions according to a number of different traits—one of which is its criticality. An exception with a low criticality level, for example, will generate little more than an entry into our logging system—whereas a critical exception (such as, for example, a database that suddenly fails to respond) warrants an escalating series of actions, like sending an e-mail to the IT folks or, in more extreme cases, even page someone.
It pays, when you’re designing the architecture for a system, to introduce the concept of criticality early on, so that you get to decide how important any given class of errors is based on the specific constraints of your business requirements, as opposed to some abstract concept of ideal error reporting functionality. Items that might be critical to your particular business—like the ability to perform fraud checks on an e-commerce site that consistently manages high-risk transactions—may warrant actions that for others may not be as important.
Break Before You Fix
Fixing an error as soon as you become aware can be a costly mistake—and a difficult one to understand from a business perspective. When the world comes crashing down on your company in the form of a broken website, the last thing that management and investors are going to want to hear is “well, let us figure out what’s wrong first,” but, realistically, that’s exactly what you should be saying.
Far too often, I see developers rushing into fixing a bug without considering all the possible consequences of their changes—thus making a strong case for doing the exact opposite!
A bug fix should always be preceded by a reproduction test. It’s not until you’ve managed to consistently cause a bug to rear its ugly head that you have really understood what causes it. A proper test provides you with an easy, quick and consistent way to reproduce a bug and make sure that you have effectively squashed. More importantly, a test “is forever”—it becomes a useful regression tool so that the next time you fix a bug you have one more way to ensure that your changes do not have any undesired side effects.
Of course, in order to write a test, you need some sort of test harness, which is a deceptively simple concept that a lot of developers (and managers) still seem to have a big problem coming to terms with. You don’t have to be a believer in TDD to appreciate the advantage of a good testing platform—and today you have access to so many that are easy to use, powerful and inexpensive that it really doesn’t make sense for your project not to have one, no matter how small its size or simple its function.
If you’re interested in more on handling errors, Jeff Atwood at Coding Horror has recently published an interesting piece on the subject.
Comments
Interesting blog you link to, itself links to elmah a nice looking .Net error reporting mechanism.
I am just setting out to develop a complex webapp and I am taking the opportunity to tightly build in error logging for the first time (or for the first time properly, I mean – not as an afterthought)
Pear::Log seems to do what I want, but I cannot see it handling crashes without having something like Xdebug on the live server.
What is the PHP equivalent of elmah?
Seen any articles about Exception-driven development?
PaulG – take a peak at Zend Server. The Monitoring capabilities & integration with Zend Studio for error capture & root cause analysis could be what you're looking for.
http://www.zend.com/server/
Caveat: I work at Zend.
Mr. G, that link only works without a trailing slash
I take the same approach to error/exception handling. The default state of any app should be failure. Also agree that a developer should never band-aid a bug but find the underlying issue.
As for how I handle management when there is a bug, I revert everything back to the last stable version and then branch the buggy branch to dev for fixing. Granted the only downside to that is if the latest revision has a hot new features, they’re gone until things are resolved…