3 Lessons From the Big Amazon Web Services Outage

Cloud will go down, have a communication plan, don't overreact, Josh Kim writes.

March 15, 2017

The Feb. 28 Amazon Web Services (AWS) S3 web-based storage outage is likely to make the top 10 list of higher ed-tech stories of 2017.

What did you lose on your campus that day?  

I’ve tried to construct a list of higher ed services that were impacted by this five-hour AWS outage. By Googling “AWS outage site:edu,” I’ve come up with this partial list of online services that were down or degraded across higher ed on Feb. 28: Blackboard, Canvas, Echo360, Qualtrics, Zoom, Lynda, Tegrity, Kaltura, Respondus, Turning Technologies, BlueJeans, JSTOR, Elsevier ScienceDirect, Proquest, VoiceThread and Piazza.  

In light of the AWS outage, what questions should provosts and presidents be asking their CIOs about moving to the cloud? And what message should CIOs be giving to their campus leadership colleagues?

I answer those questions with these three lessons from the AWS outage:

Lesson 1 -- Cloud Services Will Go Down

Working in higher ed technology, you learn that eventually all technologies will fail. Anything that can break will eventually break.Cloud computing is not different. You might not be able to plan for unplanned and unexpected downtime -- but you can expect that it will happen.

Amazon’s Service Level Agreement (SLA) for S3 for object storage promises 99.9 percent availability, after which Amazon provides service credits. A 99.90 percent availability equates to 8.76 hours of unplanned downtime per year. At 99.99 percent availability, service will be down for 52.6 minutes. Even at 99.999 percent, uptime record will result in 5.26 minutes of downtime a year.

The bottom line is that any complicated technology -- and cloud computing is a very complicated interrelated and interconnected series of technologies -- will have some downtime. Once we recognize and accept that outages and issues will occur, we can prioritize the work of planning for dealing these eventualities.  

Lesson 2 -- Have a Communications Plan 

Every academic IT unit should have plans for what to do when critical cloud services go down. This means having a communications plan in place. It should be very clear what the communications will be, who will send them, where the communications will appear and at what intervals. Students and faculty are forgiving of technical service interruptions, even interruptions of critical platforms, if detailed information is shared in a timely manner. 

A critical point: The platforms used for communicating service interruptions should not be dependent on a single cloud provider. System status pages and alerts should be separate from any of the infrastructure used to provide campus services. (That will take some work to figure out.) 

One of the ironies of the Feb. 28 AWS outage was that the Service Health Dashboard (SHD) that Amazon relies on to communicate service issues was down because it depended on the S3 region, which was having problems. So be sure that there is built-in resiliency and redundancy to the platforms that will be used to communicate downtime and service interruptions. 

Lesson 3 -- Don’t Overreact

The final lesson of the AWS S3 outage is don’t overreact and don’t panic. Knowing that it is normal for the cloud to break down does not mean that higher ed shouldn’t use the cloud.  

The reality is that downtime from Amazon, Google or Microsoft cloud providers is likely to decrease as lessons from the Feb. 28 outage are absorbed and acted upon.To Amazon’s credit, the company was transparent and forthcoming on the causes of the problem, and how it plans to mitigate the risk of future outages. 

We need to remember that while our cloud services can and will go down, our locally hosted applications in our local data centers will also experience “normal” failures. Unplanned and unanticipated downtime is an expected part of any technology service, no matter if that technology lives on or off campus.

The Feb. 28 AWS S3 outage teaches us that technology failures are normal, and that it is up to us to be ready for them.

What do you think the lessons of the AWS S3 Feb. 28 five-hour outage are for your institution?


Back to Top