Meltdown: Why Our Systems Fail and What We Can Do About It by Chris Clearfield and András Tilcsik
Published in March of 2018.
Congratulations. You are a provost. You are charging a newly formed search committee to recommend your institutions next chief information officer (CIO). What book do you give the search committee?
Congratulations. You are a new CIO. You want your entire team to be thinking about resiliency. What book do you give your team?
A good answer to these two questions would be Meltdown: Why Our Systems Fail and What We Can Do About It.
It is hard to read Meltdown, which is not about higher ed IT, and not come to some strong conclusions about higher ed IT. First, big IT failures at individual institutions are not only probable, they are likely. Second, those failures will be worse than we expect. And third, there are some things that every campus IT organization can do to build resiliency.
The authors of Meltdown step through a range of systems, technical, and organizational failures in order to draw some unified conclusions about calamities. Examples range from fatal subway crashes to airline disasters (both now exceedingly rare), to company failures such as Enron.
All cases of meltdowns share a few common traits. Systems that are tightly coupled are more likely to experience systemic failure, as a single failure can bring down the entire operation. Overly complex systems or technologies are also likely to break, as complexity and risk tend to go hand-in-hand.
On the organizational side, companies and institutions that are homogenous are also likely to be fragile. Clearfield and Tilcsik synthesize the academic research that demonstrates how diversity is key to resilience and sustainability. An organization that lacks the insights from a range of perspectives, and that prioritizes “cultural fit” above diverse backgrounds, is unlikely to anticipate non-standard risks.
If you’ve spent time either within a campus IT organization, or have worked closely with one, you know that they display many of the characteristics that lead to meltdown. Our campus technologies are increasingly tightly coupled and highly complex. How much does the business of our colleges and universities depend on the network and the applications that run our financial, educational, and research operations? If the network goes down, or the applications stop working (or if the data is compromised), how much of normal academic work will be impacted?
In the same way, and despite what I would say are well-meaning efforts by the higher ed IT profession, the diversity of campus information technology organizations remains non-optimal. The makeup of higher education IT as a profession does not mirror the population that it serves. This lack of diversity in higher ed IT, which is part of a larger diversity challenge across the technology industries, results in a more fragile information technology ecosystem.
Reading Meltdown should help campus IT organizations become more resilient. Higher education IT leaders must communicate that campus systems will break. That technological failures, despite everyone’s best efforts, are inevitable. What is necessary is to have a plan for recovery. To have rehearsed what happens within the organization when the technology stops working . This is about having redundant systems and well-worked out plans for communications. This is about de-coupling systems where possible, and prioritizing simplicity and robustness over customization.
Meltdown is a good book to read if you enjoy reading about why things break. If you are a natural worrywart (I am), then Meltdown is the book for you.
What book would you suggest to a CIO search committee?
What book would you want your IT organization to read?
What do CIO’s read?
What are you reading?