Article on difficulties in social-media research

The Archive Is Closed

Social-media researchers have more than enough material for their research -- and that's getting to be a problem. Scott McLemee makes some inquiries.

You have /5 articles left.
Sign up for a free account or log in.

Five years ago, this column looked into scholarly potential of the Twitter archive the Library of Congress had recently acquired. That potential was by no means self-evident. The incensed “my tax dollars are being used for this?” comments practically wrote themselves, even without the help of Twitter bots.

For what -- after all -- is the value of a dead tweet? Why would anyone study 140-character messages, for the most part concerning mundane and hyperephemeral topics, with many of them written as if to document the lowest possible levels of functional literacy?

As I wrote at the time, papers by those actually doing the research treated Twitter as one more form of human communication and interaction. The focus was not on the content of any specific message, but on the patterns that emerged when they were analyzed in the aggregate. Gather enough raw data, apply suitable methods, and the results could be interesting. (For more detail, see the original discussion.)

The key thing was to have enough tweets on hand to grind up and analyze. So, yes, an archive. In the meantime, the case for tweet preservation seems easier to make now that elected officials, religious leaders and major media outlets use Twitter. A recent volume called Twitter and Society (Peter Lang, 2014) collects papers on how politics, journalism, the marketplace and (of course) academe itself have absorbed the impact of this high-volume, low-word-count medium.

One of the book’s co-editors is Katrin Weller, who is an information scientist from the GESIS Leibniz Institute for the Social Sciences, in Cologne, Germany. At present she is in the final month of a Kluge Fellowship at the Library of Congress, which seems like an obvious place to conduct her research into the use of Twitter to study historical events. Or it would have been, if the archive of tweets were open to scholars, which it still isn’t, and won’t be any time soon.

Unable to pursue her original project, Weller used the Kluge Fellowship to broaden her focus -- which, she told me in an email exchange “has been pretty much on working with Twitter data [over] the last years.” She spent her time catching up with the scholarship on other forms of social media and investigating various web-archiving projects at the library.

As for the digital collection that made her want to go to Washington, DC, in the first place… well, the last official statement from library was issued in January 2013. It reported that Twitter’s output from 2006 to 2010 -- consisting of “approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description” -- had finally been organized, by hour. The process was to be completed that month, even as another half billion or so tweets per day were added to the collection.

The Library of Congress finds itself in the position of someone who has agreed to store the Atlantic Ocean in his basement. The embarrassment is palpable. No report on the status of the archive has been issued in more than two years, and my effort to extract one elicited nothing but a statement of facts that were never in doubt.

“The library continues to collect and preserve tweets,” said Gayle Osterberg, the library’s director of communications, in reply to my inquiry. “It was very important for the library to focus initially on those first two aspects -- collection and preservation. If you don’t get those two right, the question of access is a moot point. So that’s where our efforts were initially focused and we are pleased with where we are in that regard.”

As of early 2013, the library reported it had received more than 400 requests to use the archive. Since then, members of the public have asked for updates on the library’s blog, with no response forthcoming. At this point no date has been set for the archive to be opened to researchers. The leadership of the Library of Congress may be “pleased [by] where we are,” but their delight is not likely to be contagious.

No grumbling from Katrin Weller, though. She sent me a number of her recent and forthcoming papers on what might be called second-order social-media research. That is, they take up the problems and concerns that face scholars trying to study social media.

Apart from the difficulties involved in archiving -- enough on that, for now -- there are methodological and ethical problems galore, as becomes clear from a paper Weller co-authored with her colleague Katharina E. Kinder-Kurlanda, a cultural anthropologist also at the Leibniz Institute. In 2013 and 2014, they conducted 42 interviews with social-media researchers at international conferences. The subjects were from various fields and parts of the world. What they had in common was the use of data gathered from a variety of social-media venues -- not just Twitter and Facebook but “many other platforms such as Foursquare, Tumblr, 4chan and Reddit.”

Elsewhere, Weller has described social-media research as a kaleidoscope containing “thousands of individual pieces, originating from different perspectives and discipline, applying different methods and establishing different assumption about social media” -- with the kaleidoscope constantly shaking from site redesigns, changes in privacy policy and so on.

All of which makes establishing methodological standards -- how material from social media platforms is collected, documented and handled -- extremely difficult, if not impossible. A research team might find it necessary to invent a program to harvest raw data from a site, but if the overall focus of the project is sociological or linguistic, the details will probably not be discussed in the resulting publication. There is also the issue of “data cleaning,” i.e., filtering out messages from spam accounts, bots and the like, in order to create a data set consisting of only human-generated material (as much as that is possible). It is a time- and labor-intensive process, and the thoroughness of the job will in part be a function of the budget.

So the size, quality and reliability of the raw material itself are going to vary widely from researcher to researcher. Weller and Kinder-Kurlanda note the case of the same data being collected from a single social-media website using the same tools, but run in parallel on two different servers. The result was different data sets. And all of this, mind you, before the serious analytical crunching even gets started.

One partial solution, or at least stopgap measure, is to share data sets -- certainly easing the strain on some researchers’ purses. The authors mention finding researchers “who felt an ethical obligation to share their data sets, either with other researchers or with the public.” About a third of the researchers Weller and Kinder-Kurlanda interviewed “had experience in working with data collected by others.” But the practice raises ethical problems about privacy, and it sounds like some of the exchanges take place sub rosa. And in any event, sharing the data sets probably won't change the drift toward some social-media platforms being over- or underresearched because their data are easier to collect or clean.

Weller indicates that she intends to write more about the epistemological issues raised by social media. That sounds like an interesting topic, and a perplexing one. Besides, it will clearly be a long, long time before anyone gets to use Twitter as a tool for historical research.