Amazon S3 was down for 8 hours today. (So was SQS, but nobody seemed to care. I guess there aren’t a lot of loud, public-facing SQS users.)
This should make a lot of companies rethink their reliance on Amazon Web Services or any significant architectural requirement that they can’t control. Some things to think about:
That last one’s interesting because there isn’t a drop-in replacement for many of these services. It’s not just a matter of pointing your app at a different set of servers — you can’t, for instance, run a compatible backup instance of S3 or EC2 on your own infrastructure.
Part of the promise of “cloud” services is that you don’t have to worry about their infrastructure. In theory, these services should tolerate individual failures transparently — you should never have to worry about one of S3’s servers dying and losing your files.
But this isn’t the first time that AWS has had major, service-wide downtime. Individual hard drives and servers can indeed fail without us noticing, but if a system-wide problem occurs such as a software bug, a cascading overload, or a natural disaster hitting a datacenter, the entire service is affected. Instead of one service failing, all services are taken down and all customers are affected.
The promise of the cloud is flawed. It’s a leaky abstraction. We’re sold on the idea of a bulletproof, hands-off service that we’ll never need to think about and that abstracts away the petty vulnerabilities of individual servers. But in reality, the cloud itself is still running on a bunch of individual servers, and it’s still built and operated by a bunch of fallible humans.