Should we redefine statistical significance?

RSS international conference // Cardiff, UK

Richard D. Morey

5 September 2018

Science and statistics


Statistics: contextual problem solving

  • Research design
  • Summary
  • Visualization
  • Fallacies/“paradoxes”

When all you’ve got is a hammer

“Null hypothesis significance testing”


  • Bastardization of classical significance testing logic
  • Focus on single hypothesis
  • Focus on single “effect”
  • Focus on single criterion for “existence”
  • Statistical rejection mistaken for theoretical support

A replication crisis?

Newly popular: Replication attempts

  • Reproducibility Project: Psychology
  • “Many labs” projects
  • Large-scale replications by skeptics

The hammer strikes again

  • Single effects
  • Primary outcome: “Is replication ‘significant’?”

Many replications “fail” (!)

Redefine statistical significance (RS)?

RS primary arguments

“We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.”


  1. p.05 represents “weak [Bayesian] evidence”
  2. p<.05 inflates “false discovery rate”
  3. p<.05 results less likely to “replicate” than p<.005

➡ Using p<.05 “leading cause of non-reproducibility”

Counter-arguments

The RS proposal…

  • …encourages contextless, simplistic view of “discovery”
  • …is based on restrictive, unrealistic prior assumptions
  • …exaggerates role of necessarily rare p values near .05

Among others:

  • Argument 2: “False discovery rate” is flawed (Mayo & Morey, in prep.)
  • Argument 3: true for all α2<α1

Priors/models

Dividing, p value

p=0.13

Dividing, Bayes factor

BFnp=0.9310.93/11=14.06

Two-sided, point null: BF and p

One-sided, point null: Bayes factors

  • Idea: choose half the prior for a one-sided test
  • Bayesian can choose side consistent with data
  • Evidential BF “boost”: approx. 2p (for smallish p)

One-sided, point null: BF and p

Best-case Bayesian argument?


  • Point null may be appropriate for some problems
  • No reason for Bayesians to use two-sided hypotheses!
  • So: p<.03 or p<.02 more appropriate calibration

But…are these p values that common without cheating?

How common are these p values?

How common are these p values?

How common are these p values?

These p values are rare!

RS’s arguments fail

  • “Discovery” is more than just a significant p value
    • A new criterion is unlikely to help the situation
  • No general calibration of p values to Bayes factors
    • Sometimes p will “appear” to be more lenient…
    • …sometimes BF will “appear” to be more lenient
  • Unlikely that marginal p values have helped caused the crisis
    • Best-case Bayesian argument: weakly evidential p’s are rare

Thank you.