Update: It turns out that my website was not affected by the outage; just the Yottaa monitoring service. I was not in the one availability zone suffering from an outage. So much of what I wrote below was incorrect. Also, they’ve posted a detailed postmortem which is a great bedtime read.
For a few months now I’ve been cheering on Amazon as I’ve dived further into its cloud services. It has had excellent performance, little downtime, and is extremely affordable. But on the road to work today, I was listening to my Marketplace podcast, and a discussion of the reliability of cloud services took place, specifically pointing to Amazon’s recently massive outage which hit massive sites like Reddit and FourSquare. Wait, what? There was an outage?
I have my cloud services set up to be monitored by external websites to alert me to exactly these kinds of problems. I double checked my inbox… nope, no alerts. Nothing in the RSS feeds either. Hmm… I logged into mon.itor.us, but after battling with the pretty awful user interface for half an hour, found out that history was capped at 24 hours. The event occurred 4 days ago. Great.
Then I checked Yottaa, and it showed me this:
Well that’s no good!
My website was offline between 04/21/2011 03:00 and 04/22/2011 15:00; 37 hours. Oddly, it didn’t even register as an outage in Yottaa’s dashboard, nor did I get any alerts (apparently that’s not even an option)… well, Yotta’s got a beta label on their front page, so maybe it will get better. Ironically, the CSS wasn’t loading when I was using Yottaa’s site today. I seem to recall that Yottaa is hosted on Amazon’s servers…
I have a third monitoring service – Amazon CloudWatch. The last time I had an issue was when the server stopped responding due to a hardware failure. CloudWatch sent me an email that let me know with 10 minutes that the server was down, and I switched the site to a different server within an hour. Alas, no such email came for this incident… the server itself wasn’t suffering from a CPU or network fault, at least from the CloudWatch server’s perspective. In fact, it still isn’t clear what exactly went wrong… Amazon’s Service Health Dashboard has noted that it was a problem with their storage backend, but CloudWatch shows successful disk I/O on my instance throughout the incident.
This outage could have been avoided if I had any redundancy for this website (the outage only badly affected one of the four data zones located in Virginia, where my stuff happens to be; if I had had another server on Amazon’s cloud in Japan or California or Ireland I would have been fine)). But I don’t, because I’m a cheap bastard and don’t care about the Internet’s feelings.
So despite Amazon’s first major outage, I’m sticking with them. They were sufficiently transparent during the process via their status updates and will hopefully produce an interesting post-mortem. I expect that they’ll learn a lot from their mistakes, and that websites that rely on Amazon will learn to implement some redundancy (though most of the impacted sites are startups that are as frugal as I am, so maybe not).