“Null hypothesis significance testing” Bastardization of classical significance testing logic Focus on single hypothesis Focus on single “effect” Focus on single criterion for “existence” Statistical rejection mistaken for theoretical support
Newly popular: Replication attempts Reproducibility Project: Psychology “Many labs” projects Large-scale replications by skeptics The hammer strikes again Single effects Primary outcome: “Is replication ‘significant’?” Many replications “fail” (!)
“We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.” p≈.05 represents “weak [Bayesian] evidence” p<.05 inflates “false discovery rate” p<.05 results less likely to “replicate” than p<.005 ➡ Using p<.05 “leading cause of non-reproducibility”
The RS proposal… …encourages contextless, simplistic view of “discovery” …is based on restrictive, unrealistic prior assumptions …exaggerates role of necessarily rare p values near .05 Among others: Argument 2: “False discovery rate” is flawed (Mayo & Morey, in prep.) Argument 3: true for all α2<α1
Idea: choose half the prior for a one-sided test Bayesian can choose side consistent with data Evidential BF “boost”: approx. 2−p (for smallish p)
Point null may be appropriate for some problems No reason for Bayesians to use two-sided hypotheses! So: p<.03 or p<.02 more appropriate calibration But…are these p values that common without cheating?
“Discovery” is more than just a significant p value A new criterion is unlikely to help the situation No general calibration of p values to Bayes factors Sometimes p will “appear” to be more lenient… …sometimes BF will “appear” to be more lenient Unlikely that marginal p values have helped caused the crisis Best-case Bayesian argument: weakly evidential p’s are rare Thank you.