5 September 2018

Science and statistics


Statistics: contextual problem solving

  • Research design
  • Summary
  • Visualization
  • Fallacies/“paradoxes”

When all you’ve got is a hammer

“Null hypothesis significance testing”


  • Bastardization of classical significance testing logic
  • Focus on single hypothesis
  • Focus on single “effect”
  • Focus on single criterion for “existence”
  • Statistical rejection mistaken for theoretical support

A replication crisis?

Newly popular: Replication attempts

  • Reproducibility Project: Psychology
  • “Many labs” projects
  • Large-scale replications by skeptics

The hammer strikes again

  • Single effects
  • Primary outcome: “Is replication ‘significant’?”

Many replications “fail” (!)

Redefine statistical significance (RS)?

RS primary arguments

“We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.”


  1. \(p\approx.05\) represents “weak [Bayesian] evidence”
  2. \(p<.05\) inflates “false discovery rate”
  3. \(p<.05\) results less likely to “replicate” than \(p<.005\)

➡ Using \(p<.05\) “leading cause of non-reproducibility”

Counter-arguments

The RS proposal…

  • …encourages contextless, simplistic view of “discovery”
  • …is based on restrictive, unrealistic prior assumptions
  • …exaggerates role of necessarily rare p values near .05

Among others:

  • Argument 2: “False discovery rate” is flawed (Mayo & Morey, in prep.)
  • Argument 3: true for all \(\alpha_2<\alpha_1\)

Priors/models

Dividing, p value

\(p = 0.13\)

Dividing, Bayes factor

\(BF_{np} = \left.\frac{0.93}{1 -0.93}\middle/\frac{1}{1}\right. = 14.06\)

Two-sided, point null: BF and p

One-sided, point null: Bayes factors

  • Idea: choose half the prior for a one-sided test
  • Bayesian can choose side consistent with data
  • Evidential BF “boost”: approx. \(2 - p\) (for smallish \(p\))

One-sided, point null: BF and p

Best-case Bayesian argument?


  • Point null may be appropriate for some problems
  • No reason for Bayesians to use two-sided hypotheses!
  • So: \(p<.03\) or \(p<.02\) more appropriate calibration

But…are these p values that common without cheating?

How common are these p values?

How common are these p values?

How common are these p values?

These p values are rare!

RS’s arguments fail

  • “Discovery” is more than just a significant p value
    • A new criterion is unlikely to help the situation
  • No general calibration of p values to Bayes factors
    • Sometimes p will “appear” to be more lenient…
    • …sometimes BF will “appear” to be more lenient
  • Unlikely that marginal p values have helped caused the crisis
    • Best-case Bayesian argument: weakly evidential p’s are rare

Thank you.