Embracing 'Messy' Science

The American Statistical Association seeks to embrace science's inherent complexity and push for more data transparency by rejecting a common, oversimplified measure of statistical significance.

March 15, 2016

Is the tyrannical reign of the P value finally ending (if it was ever tyrannical at all)? An unprecedented statement from the American Statistical Association seeks to usher in a “post-P<0.05 era” and encourage stronger science and many -- but not all -- scientists approve.

“The statement is about as well balanced and ecumenical as one could have expected,” said Jeff Rouder, Middlebush Professor of Psychological Sciences at the University of Missouri at Columbia, who before reading the statement worried about social scientists abdicating methodological concerns to statisticians. “It focuses rightly on what can’t be done with P values while noting that it does indeed measure the incompatibility of data with a specified hypothesis.”

Rouder added, “They did a good job of not telling us what to do, but of providing caution.”

A primer on P values is probably already overdue. The association's new statement defines a P value as “the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.” P values are often used in experiments to suggest evidence regarding the null hypothesis, or the assumption being tested, with smaller values indicating evidence against the hypothesis.

A recent, widely cited study of the objectivity of student evaluations of teaching, for example, suggested that male college students in one sample gave higher scores to male instructors, especially in history and microeconomics (with P values of 0.01, respectively). The researchers were testing the assumption that male students would rate male and female instructors of the same caliber equally, but the low P values hinted that the assumption didn’t hold -- and that gender bias was at play (the alternative hypothesis).

There’s nothing inherently wrong with P values, and they can be helpful in investigating a scientific phenomenon. But scientists have been arguing for years that too often, P values stand in for definitive statistical significance when they’re lower than 0.05 (or P<0.05, hence the statistical association’s hope for a “new era”). In fact, P values are never definitive, and many social scientists say that the 0.05 threshold is arbitrary. And yet, they say, it’s become gospel in terms of getting one’s research published, in that journals tend to reject even scientifically significant findings that for whatever reason don’t meet the widely accepted threshold for statistical significance.

As Andrew Gelman, professor of statistics and political science at Columbia University, put it on his blog (and to the statistical association), “This is what is expected -- demanded -- of subject-matter journals. Just try publishing a result with p = 0.20. If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80\%  [sic] certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted.”

Such concerns led the editors of Basic and Applied Social Psychology to ban papers containing P values last year. “We believe that the P< 0.05 bar is too easy to pass and sometimes serves as an excuse for lower-quality research,” its editors wrote, explaining the decision.

The move prompted much debate among social scientists, and within the statistical association. Soon the governing board tasked Ron Wasserstein, its executive director, with assembling a diverse group to weigh in on the issue of P values and statistical significance. The group communicated for months about whether an agreement could be reached, what format the statement might take, and who the target audience would be. Major points of contention were how to characterize alternatives to the P value, and the assertion that a P value near 0.05 offered only weak evidence against a null hypothesis. Twenty members of the group met in October to name principles upon which any statement could be built. Drafts ensued, and the association’s Executive Committee approved a statement earlier this year.

It was released to the public last week, with an accompanying document from Wasserstein and his co-author, Nicole A. Lazar, professor of statistics at the University of Georgia, saying, “Let’s be clear. Nothing in the … statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail. We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to change the practice of science with regards to the use of statistical inference.”

While not revolutionary, the statement still signals a shift in how statisticians and social scientists generally are thinking about statistical significance, especially in a time of vast computation abilities and concerns about data integrity.

“Increased quantification of scientific research and a proliferation of large, complex data sets in recent years have expanded the scope of applications of statistical methods,” the statement reads. “This has created new avenues for scientific progress, but it also brings concerns about conclusions drawn from research data. The validity of scientific conclusions, including their reproducibility, depends on more than the statistical methods themselves. Appropriately chosen techniques, properly conducted analyses and correction interpretation of statistical results also play a key role in ensuring that conclusions are sound and that uncertainty surrounding them is represented properly.”

In addition to defining P values, the statement offers the following guiding principles for using them and determining statistical significance:

  1. P values can indicate how incompatible the data are with a specified statistical model.
  2. P values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a P value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A P value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a P value does not provide a good measure of evidence regarding a model or hypothesis.

Wasserstein in an interview said the association is not seeking to ban P values in research, or proposing that they be replaced with an equally problematic single alternative. Instead, he said, the association advocates transparency regarding methods and findings, to produce a fuller scientific record. He said he doesn't believe that what some have called a reproducibility “crisis” in the social sciences sprang from a purposeful manipulation of data, but rather an incomplete scientific record attributed in part to overdependence on P values and a preference for easily digestible conclusions.

Science is simply “messier” than that, Wasserstein said. “Taking everything you learned from conducting an experiment and assigning statistical significance, and thinking you can boil it down to one number and a single conclusion is excessively hopeful and misleading. … A P value is just a single number, which cannot hope to summarize all the information contained in scientific research.”

The statement suggests additional approaches and methods, such as those that emphasize estimation over testing, including confidence, credibility or prediction intervals; Bayesian methods; alternative measures of evidence, as in likelihood intervals; or decision-theoretic modeling and false discovery rates. All such approaches rely on further assumptions but may more directly address the size of an effect or whether the hypothesis is correct, according to the association.

“No single index should substitute for scientific reasoning,” the statement concludes.

Other disciplinary associations are concerned with matters of data transparency and reproducibility. The American Political Science Association recently adopted new standards for data access and research transparency, for example, prompting some faculty objections. Psychology has had its share of reproducibility concerns, as well, perhaps best articulated in a 2015 study that found the results of a majority of experiments with a P value of 0.05 or less couldn’t be replicated.

Howard S. Kurtzman, acting executive director for science at the American Psychological Association, said via email that the statistical association “has done all scientists a great service by carefully examining the meaning and role of significance testing and P values in quantitative research. … I encourage scientists to carefully think about how they approach statistical analysis and to consider a broader range of techniques as appropriate for their specific research questions and their data.”

Rouder, at Missouri, said the statement wouldn’t affect him because he already prefers Bayesian evidence measures -- through which he can derive predictions for competing models and assess how accurately each predicted the observed data -- to P values. But he said document has implications for the field, especially in light of current reproducibility concerns.

Over all, he said, “these are very exciting times to be a methodologist.” Perhaps the “coolest part,” he added, is “that people are finally realizing that analysis is not a matter of following prescribed rules, but requires thought, insight and, in some cases, courage. We are starting to have more accountability and transparency, and that is a win for social science.”

Gelman, who was on the association’s committee but did not write the statement, said via email that the document “reflects a gradual but welcome shift in statistics, moving away from null hypothesis significance testing (which is all about rejecting a model nobody likes anyway) toward direct modeling of effects and their variation.” (In a blog post, he elaborated on his distaste for null-hypothesis testing, calling it “that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B. … Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.”)

Not everyone approves of the association's statement, however. Philip B. Stark, associate dean in the Division of Mathematical and Physical Sciences at the University of California at Berkeley, served on the statement committee and published a dissent alongside the document. (He also co-authored the paper recent paper on student evaluations of teaching that relied on P values.) Stark said via email that the statistical association's stance "comes dangerously close to throwing out the baby with the bath water. There is no simpler, more broadly applicable statistical tool for preventing self-deception than the P value." 

Stark added, "It's like a knife: if you don't use it properly you may injure yourself or others. The problem with P values is a problem shared by most if not all statistical techniques: it needs to be used intelligently and carefully. One of the biggest pitfalls in using P values to ignore the process of selecting which hypotheses to test and which inferences to report. Virtually every statistical technique for quantifying uncertainty has the same vulnerability."


Back to Top