Landmark study suggests most psychology studies don't yield reproducible results

You have /5 articles left.
Sign up for a free account or log in.

Science is a process, not a product -- often a long one. And key to the process of organizing and creating knowledge is replication, or reproducing even the most credible-seeming results to help confirm their validity or to expose flaws in the work. Too often, however, for a variety of reasons and competing interests -- why test someone else’s results when journals favor original research, for example? -- replication becomes the missing step in the scientific process. That leaves the door open to research misconduct or fraud or, worse and much more common, promising data gone untested.

While the replication problem is widely acknowledged, it is largely unexplored. So a new landmark study suggesting that the results of the vast majority of major recent psychology studies can’t be replicated stands out, and poses important questions that stretch beyond psychology to the other sciences: What do poor reproducibility rates mean, and how can scientists and publishers help put a bigger premium on replication?

“Credibility of the claim depends in part on the repeatability of its supporting evidence,” said Brian Nosek, a professor of psychology at the University of Virginia who led the study and is executive director of the Center for Open Science. Despite that, he said, “little is known about the reproducibility of research in general, and there's been growing concern that reproducibility may be lower than expected or desired.”

Hence, “Estimating the reproducibility of psychological science,” in today’s Science. The four-year study involving 270 co-authors replicated 100 social and cognitive studies published in three top psychology journals in 2008. Nosek said he was expecting about a 50 percent reproducibility rate, but the actual results were much lower: while 97 percent of the original studies produced significant results (a p-value of 0.05 or less) for whatever theory was being tested, just 36 percent of replications did. The effect sizes or magnitudes of the original studies were about half in Nosek’s project, too.

Nosek said there were three possible reasons for his results: that the original effect could have been false positive, that the replication was a false negative, or that both the original and replication results are accurate but that each experiment’s methodology differed in significant ways.

Correlational tests suggest that replication success was better predicted by the strength of the original evidence, however, than by characteristics of the original and replication teams. Plus, Nosek’s team tried to minimize error by obtaining all original materials from the study authors, getting feedback from them about the replication designs and making all protocols and data public.

Different results also may have been observed because the phenomenon being studied isn’t yet well enough understood to anticipate sampling differences.

E. J. Masicampo, an assistant professor of psychology at Wake Forest University, both replicated a study and had his own study replicated. Masicampo successfully reproduced results of a paper suggesting that people prioritize instrumental or useful emotions when confronted with a particular task, as opposed to prioritizing pleasant emotions all the time. The set-up tested whether people preferred listening to angry music or recalling angry memories in anticipation of playing a violent video game, for example.

His original study of how people make effortful decisions, however, was not successfully replicated. Masicampo found in his own 2008 study of Florida State University undergraduates (while he was a graduate student there) that a sugary beverage provided a boost of energy that helped those who were previously mentally fatigued to avoid taking unhelpful mental shortcuts in deciding between hypothetical apartments that traded off on space and distance. But for reasons that aren’t entirely clear, the methodology didn’t “translate” at the replication site, the University of Virginia, he said.

“We were trying to be really faithful to the original study in using the original materials, but it became immediately clear upon looking at the results that this simply wasn't an effortful decision for participants,” Masicampo said. “So it really highlighted the issue of how exact these replications should be and how much things change going from one place to another.”

Of course, sometimes irreproducibility comes from possible research fraud. One of the more notorious cases in recent memory is that of a Diederik Stapel, the former head of the social psychology department at Tilburg University in the Netherlands, who was accused of making up science he wanted the world to believe -- for example, that litter and trash in a public area made people more likely to think racist thoughts. Marc Hauser, an evolutionary psychologist at Harvard University, left academe in 2011 amid charges of scientific misconduct. And Daryl Bem, a professor emeritus at Cornell University, blew the roof off social psychology in 2010 when he claimed that humans had some psychic powers -- something colleagues immediately denounced as untestable. (In the midst of all this, Daniel Kahneman, a Nobel Prize-winning psychologist at Princeton University, wrote an email to his colleagues studying social priming, warning of the looming “train wreck” that could only be avoided by more replication.)

Nosek acknowledged various controversies in the field but said the replication project was motivated more by genuine curiosity than anything else.

“We don't know the reproducibility of our work, so let's start to investigate,” he said. “Let's do a project to get some information and then use that to help stimulate improvements in the field.”

Nosek said there are many contributing factors to the reproducibility problem, but a major one is “incentives.”

“Publication is the currency of science,” he said. “To succeed, my collaborators and I need to publish regularly and in the most prestigious journals possible, however, not everything we do gets published.” While novel, positive and tidy results are more likely to survive peer review, he said, “this can lead to publication biases that leave out negative results and studies that do not fit the story that we have.”

One major implication of the study is that the sciences as a whole need to value and support replication, he said -- including journal publishers and peer reviewers.

Marcia McNutt, editor in chief of the Science family of journals, said she was especially struck by the finding that highly resolved effects in the original studies were more likely be replicated -- suggesting that “authors and journal editors should be wary of publishing marginally significant results, as those are the ones that are less likely to reproduce.”

In terms of transparency and “trustworthiness,” McNutt said Science and other journals are working with researchers to raise standards. For example, she said, Science in June published Transparency and Openness Promotion (TOP) guidelines regarding data availability and more. Some 500 journals already are signatories to TOP, she said.

Issues Throughout the Social Sciences

Replication isn’t just problematic in psychology, and researchers who have looked at the issue through other disciplinary lenses applauded Nosek’s project.

Matthew Makel, a gifted-education research specialist at Duke University, co-authored a 2014 paper saying that only 0.13 percent of education articles published in the field’s top 100 journals are replications (versus 1.07 percent in psychology, according to a 2012 study). He said that Nosek’s project was more proof that psychology -- despite its flaws -- was “years ahead” of education in terms of recognizing and remedying a problem shared by the social science research community.

“Replication results may not grab headlines, but they help us understand what results stand the test of time,” he said. “And that is what educators, policy makers, parents and students actually care about.”

Recent research controversies in sociology also have brought replication concerns to the fore. Andrew Gelman, a professor of statistics and political science at Columbia University, for example, recently published a paper about the difficulty of pointing out possible statistical errors in a study published in the American Sociological Review. A field experiment at Stanford University suggested that only 15 of 53 authors contacted were able or willing to provide a replication package for their research. And the recent controversy over the star sociologist Alice Goffman, now an assistant professor at the University of Wisconsin at Madison, regarding the validity of her research studying youths in inner-city Philadelphia lingers -- in part because she said she destroyed some of her research to protect her subjects.

Philip Cohen, a professor of sociology at the University of Maryland, recently wrote a personal blog post similar to Gelman’s, saying how hard it is to publish articles that question other research. (Cohen was trying to respond to Goffman’s work in the American Sociological Review.)

“Goffman included a survey with her ethnographic study, which in theory could have been replicable,” Cohen said via email. “If we could compare her research site to other populations by using her survey data, we could have learned something more about how common the problems and situations she discussed actually are. That would help evaluate the veracity of her research. But the survey was not reported in such a way as to permit a meaningful interpretation or replication. As a result, her research has much less reach or generalizability, because we don't know how unique her experience was.”

Ethnographic research, such as Goffman’s, is particularly hard to replicate, making its usefulness in regard to other settings a “perennial debate” in sociology, Cohen said.

Fabio Rojas, an associate professor of sociology at Indiana University, recently wrote about replication on the popular blog orgtheory.net. He said that sociology can “do better,” and suggested that dissertation advisers insist on data and code storage for students. He said journals and presses should require quantitative papers to have replication packages, and institutional review boards should allow authors to make public some version of their data.

Rojas said he actually didn’t think the psychology replication success rate was that low, considering that experiments involving human subjects are much messier than, say, a high school chemistry lab. For the most part, he said, “personal blame” isn’t at play when studies don’t replicate.

Cristobal Young, an assistant professor of sociology at Stanford who co-wrote the field study on replication packets, said that some of the articles published in top journals present compelling scientific evidence, while “others are just distracting noise that have no contribution to the advancement of knowledge.”

“The troubling thing is,” he continued, is “it is very difficult to know which kind you are reading. As a readers of scientific work, all we can do is be more skeptical of everything that is published. As social scientists, clearly we need to do more to demonstrate the validity and strength of our findings.”

Rather than be regarded as an insult, as it sometimes is, Young said, replication should be seen as the “natural next step for new empirical findings.”

Over all, Nosek said, his paper means scientists “should be less confident about many of the original experimental results that were provided as empirical evidence in support of those theories.”

More generally, he added, the paper says that science doesn't “always follow a simple straight line path from theory to experiment into understanding, and instead, there is a continual questioning and assessment of theories and of experiments -- and all of that is essential as we move towards understanding.”