Cloudy promises

Amazon S3 was down for 8 hours today. (So was SQS, but nobody seemed to care. I guess there aren’t a lot of loud, public-facing SQS users.)

This should make a lot of companies rethink their reliance on Amazon Web Services or any significant architectural requirement that they can’t control. Some things to think about:

  • What if your entire site depended on SimpleDB, and it had an unscheduled 8-hour outage a few times per year?
  • This happened through a Sunday afternoon. It didn’t matter as much for most customers because most U.S.-targeted sites have low traffic on the weekends. But what if it was down for the 8 hours that cover an entire Monday EST workday? (It only missed by about 20 hours.)
  • What if Google App Engine goes down and your app is hosted there?
  • If you had to completely dump and replace your reliance on any single infrastructure provider, how long would you be down? How much code would need to be designed, written, and tested? What infrastructure changes would you require, and how quickly could you get what you needed?

That last one’s interesting because there isn’t a drop-in replacement for many of these services. It’s not just a matter of pointing your app at a different set of servers — you can’t, for instance, run a compatible backup instance of S3 or EC2 on your own infrastructure.

The cloud

Part of the promise of “cloud” services is that you don’t have to worry about their infrastructure. In theory, these services should tolerate individual failures transparently — you should never have to worry about one of S3’s servers dying and losing your files.

But this isn’t the first time that AWS has had major, service-wide downtime. Individual hard drives and servers can indeed fail without us noticing, but if a system-wide problem occurs such as a software bug, a cascading overload, or a natural disaster hitting a datacenter, the entire service is affected. Instead of one service failing, all services are taken down and all customers are affected.

The promise of the cloud is flawed. It’s a leaky abstraction. We’re sold on the idea of a bulletproof, hands-off service that we’ll never need to think about and that abstracts away the petty vulnerabilities of individual servers. But in reality, the cloud itself is still running on a bunch of individual servers, and it’s still built and operated by a bunch of fallible humans.