Walking Into Big Data
A long walk through the English countryside and the current flap over the government surveillance of cell phone records touched off my deeply held and unreasoned Luddite reaction to "big data." Like most over-hyped trends, the surge of interest in big data and its application provokes ennui among those of us with some mileage on our sneakers. Gary King of Harvard says that with all the available "big data" students in their freshman year can be given a personalized plan to achieve their lifetime career goals. Harvard Business Review claims that data science is the sexiest new profession. Every day brings us the media hyperbole of the application of big data to commercial, political, and scientific enterprises. While some skeptics have surfaced, the mainstream press continues its love affair with big data.
The long walk I recently took through the English countryside (200 miles in two weeks) reminded me of the value of limited information and gave me unencumbered space to think about my oddly blinkered view of big data. Collecting and analyzing data is after all, how I have made a living for 30 years. Data remain to me the only icon of science left largely unsullied by politics, ego, and money. Perhaps I am just jealous, as HBR suggested the old guard of statisticians, survey methodologists, and data analysts are not equipped to join the brave new world of big data.
What convinced me otherwise was the way my husband and I recently managed to mostly not get lost on the famous yet poorly marked coast-to-coast walk through the English Lake District and Yorkshire Moors. We used a $1.50 plastic compass, survey ordinance maps, a highly schematic guidebook and each other. No GPS, no Google Maps, no iPad or iPhone, no turn-by-turn directions. The simple tools of "compass, map, and thou" are based on substantial abstractions of geographic reality subject to errors of judgment and interpretation. More detailed information would have overwhelmed us as we walked while trying to avoid deep bogs, animal excrement, and slippery precipices in the fog and rain. Decisions made with paper maps, trust, and a little visual triangulation kept us true to our course 90 percent of the time.
And so to big data… The history of science is actually one of reverse engineering. In the beginning, our measurement tools for the physical and social world were so crude that the combination of substantial abstraction and painstaking taxonomic description were the only choices. The grand theories of natural selection and relativity emerged at a time when the data were very sparse and poorly collected. To have any reasoned explanation of the world, scientists of earlier eras had to accept that the empirical world they could observe was quite limited and distorted. Improvements in our tools have allowed us over time to anchor and refine those grand abstractions with a reality closer to what is observed. Still, the world comes to us through a glass, darkly. Until very recently, we have continued to use substantial abstraction to see and understand natural and social phenomena.
The problem with big data is that it is like trying to take a sip of water from a fire hose. "Big" data is really a euphemism for all of the data thrown off by the digital engines that drive our economic and social transactions. Electronic medical records, arrest and conviction records, loyalty card data from the grocery store, all of the stuff you tell OkCupid and Match.com, Google search histories, insurance claims, cell phone calls and even the digital things we create like tweets and blog posts.
Any transaction, business process, or social engagement that uses a machine that records, counts and stores stuff in a digital format generates data. Now people and institutions leave digital footprints everywhere. We used to have to ask questions or collect paper records. Now, it is like slapping a universal bar code on the back of every person and business in the world. Every time they do something, the big barcode scanner in the sky records it and stores it. Data are no longer representing reality but rather are the reality.
The problem of course is that we have almost come full circle. Rather than too little data, poorly measured, we now have too much data, precisely measured. Our ability to use data effectively to make decisions or understand the world depends on our ability to see patterns and abstract from those patterns. Big data is, in many ways, an exact replica of reality. Using big data to make decisions is like using every square inch of soil, landscape, and sky in my 200-mile walk across England to figure out how to get around the corner in the next small village. It feels to me as if we need to return to the time of Linnaeus, the famous Swedish botanist whose pioneering classification of the natural world gave us the concept of the "species," to classify the intersecting and complexly nuanced world thrown off by our digital engines before we start making decisions using this unknown commodity. We need to rebuild those high level abstractions from the ground up to make sense of this new reality.
My difficulty with at least the political and commercial applications of big data is that our tools of abstraction and decision-making are decidedly underdeveloped when faced with this type of data. As long as Netflix doesn’t understand that when I share my account with my early 20-something daughters, their big data application will continue to recommend "Buffy the Vampire Slayer" and "Gossip Girl" to me when my real preferences run to "Masterpiece Theater" and subtitled films. On a more serious note, our real fear of the use of cell phone transaction data to understand the social networks of individuals is not necessarily about the invasion of privacy but the possibility that the wrong person will be identified as a threat because his or her data are taken out of context. It is no longer whether our data are adequate to support our theories but rather whether we have developed adequate theories to explain our highly nuanced data.
Or maybe I am just jealous that Google hasn’t come looking for me…. yet.
Felicia B. LeClere is a senior fellow with NORC at the University of Chicago, where she works as research coordinator on multiple projects. She has 20 years of experience in survey design and practice, with particular interest in data dissemination and the support of scientific research through the development of scientific infrastructure.