On Monday August 20th Instructure's cloud based learning management system (LMS) Canvas experienced "slow loads or timeouts on certain pages".
The fact that our technology will sometimes bite back comes as zero surprise to any of us who spend our lives working at the intersection of technology and learning.
Unexpected and unplanned downtime is the norm and not the exception when it comes to the platforms that we increasingly depend upon to create, manage and deliver our courses. We can and will work toward zero downtime, but until that perfect world arrives the real test is how gracefully our systems fail, what steps we take to communicate the problems, and the degree to which we are able to learn each incident.
For these reasons I think that Instructure's response to the unplanned downtime of 8/20/12 is instructive about the leadership and culture of the company.
On 8/21, Instructures' CEO Josh Coates, wrote a detailed blog post about the issues. What can we all learn from how Instructure handled these technical problems?
1. Take Responsibility: It is great that Coates, the CEO, wrote this post. He is clearly involved in the day-to-day running of the company, and while he shows strong confidence in his team, he accepts that ultimately the responsibility for any disruption in service rests with him.
2. Don't Minimize the Problem: In the first sentence of his post Coates writes, "Yesterday was our first major “bad day” for Canvas, and it couldn’t have happened on a worse day since many of you were starting Fall classes." Right away it is clear that he understands exactly how bad the timing of the downtime is, and the difficulties that this problem will cause for his customers.
3. Describe the Plan Going Forward: Canvas' problems were caused by the database servers not being able to keep up with the load. Coates describes both a short-term fix to the problem, and the long-term re-architecting of Canvas' infrastructure to avoid the problems in the future.
4. Make Communication Open and Two-Way: The fact that Coates communicated about "Canvas' Bad Day" in a blog with commenting enabled greatly enhanced the effectiveness of Instructure's response to these technical issues. While almost all the comments were very positive, the open forum also allowed at least one customer to vent. In not trying to "control the message" Coates went a long way towards building the community in which the company will rely on for its long-term success.
Was Instructure's response to the downtime perfect? If I were a customer I would have preferred more in-depth technical details about why the database failure occurred, and what the engineering team had done to simulate the load before the start of classes.
My overall read on "Canvas' Bad Day", however, is that Instructure's open and public communication about the problems demonstrated a commitment to transparency and integrity.
All of us can learn from Instructure and Josh Coates' example of how to behave when things go wrong.