How Much Down Time is Acceptable?
When you think about monitoring your web site or application performance, how much down time is acceptable to you? The easy answer is that you want to be up all the time, but that isn’t always a realistic goal — And is it even necessary?
If your company isn’t global, for instance, would it matter to your users if a mission critical application that is only used by employees during office hours, went down in the middle of the night?
It’s a question raised in a Register article last month, Who the hell cares about five 9s anymore? It’s a good point too, because as author Danny Bradbury pointed out, business users don’t care about meaningless statistics, or at least they only care about them to the extent it has an impact on their ability to complete their work. If you were up only 80 percent of the time, in other words, but that 20 percent down time happened during a time when most business users weren’t affected, they probably wouldn’t give a rat’s pattuti about it and why should they?
We are all motivated by our ability to get our work done. When Google Docs went down last week for a brief period of time after I had shut down for the day, I didn’t care that much because it didn’t affect me. If I had been in the middle of a post when it went down though, you can be sure I would have been hopping mad, and that’s the difference.
But as the Register article points out, monitoring is an increasingly complex exercise. It’s not just about a single application like Google Docs, or even a single user base. It often involves a complex web of applications, some with unique dependencies to keep working. If one piece isn’t working, the whole application is hosed, and end users aren’t really going to give you points because most of the pieces are up, if the whole application is down.
What’s more, the complexity increases when you factor in legacy applications built long before you got there and even a complex set of monitoring tools, some of which have been designed for monitoring a specific set of tasks, sitting along side more general monitoring tools.
When faced with such a complex situation, it’s not hard to understand why tracking the nature of a problem becomes more difficult because of the sheer number of monitoring tools and applications you may be dealing with.
Regardless, the real question might not be how much you’re up. It might be how much you’re down, and how much those outages affect your end user community, how well your monitoring tools gave you insight into the nature of the problem and how quickly you got back up.
Otherwise those up time numbers are nothing more than a bit of meaningless hype.