Reliability of Cloud-based applications
Nobody knows. But there’s plenty you can do about it.
In the wake of recent outages at Amazon and Salesforce whose consequences cascaded to scores of other companies and sites, it’s instinctive and largely healthy to wonder who’s to blame, what mistakes were made, exactly how reliable The Cloud is, whether your own business should build, buy, or rent, and so on.
Not all these questions have easy, correct answers, though. That’s not just, “journalists haven’t asked the right people yet”, but “we’re talking about systems that are so new and complex that no one on the planet has thought them through in any complete way yet.” Every day, organizations like Amazon and Google set new records for the volumes of data they successfully manage–and one of the few definite conclusions from their trajectories so far comes from Google’s first vice president of engineering: “At scale, everything breaks.”
My point is not to discourage you; in fact, I intend exactly the opposite. Be realistic, understand clearly what is knowable, and make the best of your own situation. There’s still plenty of room for success, even if some of the elements are uncertain.
A plan for thinking about the cloud
One thing that’s clear is that The Cloud will continue to grow. Organizations have their hands full with their own specialties, and the advantages in passing on daily responsibilities for provisioning and peering and patching and everything else The Cloud does are just too compelling. “Utility computing”, or infrastructure-as-a-service (IaaS), remains young; it will improve and, even more to the point, it will become normal.
The second crucial “knowable” is your own situation. You have the advantage over everyone else in the marketplace in comprehending your unique requirements. Our boutique consultancy, Phaseit, Inc., is a customer of Amazon Web Services, Google, and nearly a dozen other IaaS providers. Some of our clients would barely blink if their Web sites were off-line for a few days; some calculate losses in the millions from being out-of-action for more than a few minutes. Clarity about these differences allows us to craft cost-effective solutions for each one. Don’t expect to reduce all Cloud-vs.-inhouse questions to a single measurement; do start, though, with an explicit analysis of what your organization needs from its Cloud computing.
Recognize in your analysis that you can manage, minimize, or mitigate risk, but not eliminate it. While “technical fixes” are wonderful when they work, they have limits. Part of the good engineering your organization needs is to recognize what lies beyond the bounds of engineering. If nothing else, rely on honesty, or its 21st century transform, “transparency”: there might come a time when you simply need to tell your customers that your plans didn’t allow for a backhoe accident, a truckers’ strike, an atypical hurricane, and an influenza epidemic all happening in the same week.
Managing computing systems never give much opportunity for rest, simply because change is so rapid; even when you figure out the right answer today, technical advances can change everything by tomorrow. What you can do, however, especially when deciding how much of your own business to push into The Cloud, is to think clearly about your unique situation and requirements, design solutions that are right for you, understand clearly how complex systems fail, and have good recovery plans in place. Every one of these investments is “multi-purpose”: they pay off whatever you decide with and experience in The Cloud.