Little Mystery in Amazon’s Outage

Amazon was “out” for many hours last Friday. So were several of its high-profile customers, including Netflix and Instagram.

Isn’t this exactly the kind of breakdown from which The Cloud is supposed to protect us? Is there any way to look at this as other than evidence that “off-premises on-demand computing” is fundamentally broken?

“Sort of”, and “yes” are the correct answers to those questions. To understand why requires a bit of background.

Combination of events

While the outage seemed to some of those affected to go on for days, essentially all the Amazon-related technical systems were back to normal in under twelve hours. Netflix had all customers back in service in under half that.

Why do we talk about an “Amazon outage” when most people experienced it as a failure at Heroku or Pinterest or another site, rather than Amazon? The Amazon mail-order catalogue, where customers shop for ski poles or books or pressure cookers, was unaffected. However, Amazon has an electronic service, called Amazon Web Services (AWS), that effectively rents computing power and storage. When you as a consumer are watching a movie streamed from Netflix, you’re actually connected to a machine owned by Amazon and effectively leased by Netflix. The business theory is that the Netflix employees concentrate on customer service, striking deals with Hollywood, and other movie-related concerns, while the specialists at AWS keep all the wires plugged in properly, arrange for deliveries of fuel for the back-up generators, and otherwise take care of the boring parts of working with computers.

Friday night, 29 June 2012, an unusually-violent thunderstorm complex with hurricane-force winds swept past the northern Virginia datacenter where Amazon houses some of its computers. The area lost (public) power. This combined with a cascade of other problems and errors: backup generators that didn’t switch as designed, a bottleneck in rebooting affected machines, and a defect in the way AWS’s Elastic Load Balancer (ELB) routed traffic on overloaded networks. A part of AWS “stuck”, in a particularly bad way: it was neither off, nor fully functional, but stuck in-between, further confusing attempts at Netflix and other sites to restore their functionality.

At one level, then, the episode was a classic “complex system failure“: a system with lots of protections and “armoring” encountered a combination of individual events–unusual weather, multiple obscure design errors, and so on–and catastrophe resulted.

Colleagues and commenters have concluded that The Cloud is deceptive or ill-advised or somewhere in-between. There’s more to it, though.

Amazon is not the only Cloud provider

At least a few promoters encouraged decision-makers to believe that “if it’s in the cloud, it’s automatically safe”, in the words of one acquaintance. That’s cheap, it’s wrong, and it’s time to move past it.

On the other hand, it’s no wiser to jump all the way to the other side and conclude, “If it’s critical to your business, do it yourself.” The local McDonalds doesn’t disconnect from “the grid” and run its own generators; Starbucks doesn’t insist on owning its own coffee plantations. Similarly, it is perfectly legitimate for an organization to recognize that information technology (IT) plays a crucial role, yet simultaneously outsource at least some of its IT elements.

A sensible conclusion is more like “If it’s critical to your business, analyze for yourself how to achieve the reliability you need.” AWS isn’t designed for the ultimate in reliability, but competing Cloud providers and architectures are.

I have friends who are everything from bitter to astonished that AWS failed, if only for a limited time at only one of its many datacenters. Why didn’t AWS mirror all resources in a different facility and auto-failover? How could AWS not have backups for its backups?

By design: Amazon as a company strategically and deliberately lowers prices. The mission of AWS is to supply “commodity computing” at the lowest possible price. AWS provides tools and mechanisms to enhance reliability, but guarantees are not primarily what it’s selling for now.

In the future, technology or the marketplace might favor new offerings from Amazon, ones with the automatic mirroring and failover that many observers expect from the Cloud. For now, though, there’s enormous growth at the low end of the Cloud, where many customers care far more about price than they do about rare outages. For many customers, Amazon came through like a champion: it was out only a few hours on a weekend, while plenty of other businesses shut doors for days.

In any case, other parts of The Cloud are managed differently. Rackspace and a few other vendors emphasize high availability. These high-end providers juggle different engineering challenges (example: put datacenters too close together, and one weather system can wipe out multiple sites; spread them out too much, and the network won’t be able to update fast enough) so customers don’t have to–at the cost of per-computation charges that are a multiple of Amazon’s.

As I wrote at the beginning of the week, the best starting point for your specific organization is to describe clearly your own requirements. This will equip you to judge whether and where in The Cloud your own computing best fits. You’re likely to find that Amazon or one of its competitors fit your true needs far less expensively than trying to do it all on your own.

One Comment

  1. Jo says:

    No loss for me, a website owner. Except the loss of hundreds of bots that originate from Amazon EC2 servers. What garbage traffic!


  1. Lost in the Cloud | Real User Monitoring - [...] assume daily headaches of maintenance, backups, and so on: that’s a fair deal. While you know there are risks,…

Leave a Reply

Your email address will not be published. Required fields are marked *