SlideShare a Scribd company logo
1
The Statistics Wars and Their
Casualties
Deborah G Mayo
Dept of Philosophy, Virginia Tech
Research Associate, LSE
September 22, 2022
Workshop: The Statistics Wars
and Their Casualties
LSE (CPNSS)
2
Begin with a question: From
what perspective should we
view the statistics wars?
• The standpoint of the ordinary, skeptical
consumer of statistics
• Minimal requirement for evidence
3
Requirement of the skeptical
statistical consumer
• We have evidence for a claim C only to the
extent C has been subjected to and passes a
test that would probably have found it flawed
or specifiably false, just if it is.
• This probability is the stringency or severity
with which it has passed the test.
4
Applies to any methods now in use
• Whether for testing, estimation, prediction—or
solving a problem (formal or informal)
5
2. Statistical Significance test battles
& their ironies
• Often fingered as the culprit of the replication crisis
• It’s too easy to get small P-values—critics say
• Replication crisis: It’s too hard to get small P-values
when others try to replicate with stricter controls
6
• R.A. Fisher: it’s easy to lie with statistics by
selective reporting, (“political principle that anything
can be proved by statistics” (1955, 75))
• Sufficient finagling—cherry-picking, data-dredging,
multiple testing, optional stopping—may result in a
claim C appearing supported, even if it’s
unwarranted—biasing selection effects
7
Error Statistics
This underwrites the key aim of statistical significance
tests:
• To bound the probabilities of erroneous
interpretations of data: error probabilities
A small part of a general methodology which I call
error statistics
(statistical tests, confidence intervals, resampling,
randomization)
8
Fraud-busting and non-replication
based on P-values
• Remember when fraud-busters (Uri Simonsohn
and colleagues) used statistical significance tests
to expose fraud?
• data too good to be true, or
• inexplicable under sampling variation
(Smeesters, Sanna)
“Fabricated Data Detected by Statistics Alone” Simonsohn 2013
9
• How is it that tools relied on to show fraud, QRPs,
lack of replication are said to be tools we can’t
trust?
(”P-values can’t be trusted unless they are used to
show P-values can’t be trusted”)
• Simmons et al. recommend “a 21 word solution”
to state stopping rules, hypotheses, etc. in
advance (Simmons et al., 2012)
• Yet some reforms are at odds with this
10
3. Simple significance (Fisherian)
tests
“…to test the conformity of the particular data
under analysis with H0 in some respect….” (Mayo
and Cox 2006, p. 81)
…the P-value: the probability the test would yield
an even larger (or more extreme) value of a test
statistic T assuming chance variability or noise
NOT Pr(data|H0 )
11
Testing reasoning
• Small P-values indicate* some underlying
discrepancy from H0 because very probably (1- P)
you would have seen a less impressive
difference were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1 yet alone a scientific conclusion H*—only
abuses of tests (NHST?) commit these howlers
*(until an audit is conducted testing assumptions, I
use “indicate”)
12
Neyman and Pearson tests (1933) put
Fisherian tests on firmer ground:
Introduces alternative hypotheses H0, H1
H0: μ ≤ 0 vs. H1: μ > 0
• Trade-off between Type I errors and Type II errors
• Restricts the inference to the statistical alternative—no
jumps to H* (within a model)
Tests of Statistical Hypotheses, statistical decision-making
13
Fisher-Neyman
(pathological) battles
• The success of N-P optimal error control led to a
new paradigm in statistics, overshadows Fisher
• “being in the same building at University College
London brought them too close to one another”!
(Cox 2006, 195)
14
Contemporary casualties of Fisher-
Neyman (N-P) battles
• N-P & Fisher tests claimed to be an “inconsistent
hybrid” where:
• Fisherians can’t use power; N-P testers can’t report
P-values (P =) but only fixed error probabilities (P <)
• In fact, Fisher & N-P recommended both pre-
data error probabilities and post-data P-value
15
What really happened concerns
Fisher’s “fiducial probability”
The fallacy in familiar terms: Fisher claimed the
confidence level measures both error control and
post-data probability on statistical hypotheses
without prior probabilities—in special cases
But it leads to inconsistencies
16
“[S]o many people assumed for so long
that the [fiducial] argument was correct.
They lacked the daring to question it.”
(Good 1971, p. 138).
• Neyman did, develops confidence
intervals (performance rationale)
17
Do we need to know the history to get
beyond the statistics wars?
No, we shouldn’t be hamstrung by battles from 70,
80 or 90 years ago, or to what some of today’s
discussants think they were about.
“It’s the methods, stupid” (Mayo 2018, 164)
18
• Key question remains (from the fiducial battle): how
to have a post data quantification of epistemic
warrant (but not a posterior probability)?
• Severity? Calibration?
19
Sir David Cox’s statistical
philosophy
• We need to calibrate methods: how would they behave
in (actual or hypothetical) repeated sampling?
(performance)
o Weak repeated sampling: “any proposed method of
analysis that in repeated application would mostly
give misleading answers is fatally flawed”
(Cox 2006, 198)
20
Good performance not sufficient for
an inference measure (post data)
Cox’s “weighing machine” example in 1958
How can we ensure the calibration is relevant,
(taking account of how the data were obtained)
without leading to the unique case, precluding error
probabilities? (Cox 2006, 198)
“Objectivity and Conditionality” (Cox and Mayo 2010)
21
4. Rivals to error statistical accounts
condition on the unique case
All the evidence is via likelihood ratios (LR) of
hypotheses
Pr(x0;H1)/Pr(x0;H0)
The data x0 are fixed, while the hypotheses vary
• Any hypothesis that perfectly fits the data is
maximally likely
22
All error probabilities violate the LP
• “Sampling distributions, significance levels,
power, all depend on something more [than the
likelihood function]–something that is irrelevant
in Bayesian inference–namely the sample
space.” (Lindley 1971, 436)
23
Inference by
Bayes Theorem
The
Likelihood
Principle
24
Inference by
Bayesian
Theorem
The
Likelihood
Principle
Forfeits
error
probabilities
25
Many “reforms” offered as alternative
to significance tests, follow the LP
• “Bayes factors can be used in the complete
absence of a sampling plan…” (Bayarri et al. 2016,
100)
• “It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for
itself”. (Berger and Wolpert 1988, 78)
(Stopping Rule Principle)
In testing the mean of a
standard normal distribution
26
27
The LP parallels the holy grail of
logics of induction C(h,e)
I was brought up on C(h,e), but it doesn’t work.
Popper (a falsificationist): “we shall simply deceive
ourselves if we think we can interpret C(h,e) as degree
of corroboration, or anything like it.” (Popper 1959,
418).
He never fleshed out severity
28
Fisher, Neyman, Pearson were
allergic to the idea of a single rule
for ideally rational inference
• Their philosophy of statistics was pragmatic: to
control human biases.
(design, planning, RCTs, predesignated power)
• But there’s a link to formal statistics: the biases
directly alter the method’s error probabilities
• Not automatic, requires background knowledge
29
5. Bayesians: we can block inferences
based on biasing selection effects
with prior beliefs
(without error probabilities)
30
Casualties
• Doesn’t show what researchers had done wrong—
battle of beliefs
• The believability of data-dredged hypotheses is what
makes them so seductive
• Additional source of flexibility, priors as well as
biasing selection effects
31
Peace Treaty (J. Berger 2003, 2006):
“default” (“objective”) priors
• Elicitation problems: “[V]irtually never would
different experts give prior distributions that even
overlapped” (J. Berger 2006, 392)
• Default priors are to prevent prior beliefs from
influencing the posteriors–data dominant
32
Casualties
• “The priors are not to be considered expressions of
uncertainty, …may not even be probabilities…” (Cox and
Mayo 2010, 299)
• No agreement on rival systems* for default/non-subjective
priors
• The reconciliation leads to violations of the LP, forfeiting
Bayesian coherence while not fully error statistical
(casualty for Bayesians?)
*Invariance, maximum entropy, frequentist matching
33
6. A key battle in the statistics wars
(old and new):
P-values vs posteriors
• P-value can be small, but Pr(H0|x) not small, or
even large.
• T
o a Bayesian this shows P-values exaggerate
evidence against.
34
• “[T]he reason that Bayesians can regard P-
values as overstating the evidence against the
null is simply a reflection of the fact that
Bayesians can disagree sharply with each other“
(Stephen Senn 2002, 2442)
35
Some regard this as a Bayesian family feud
(“spike and smear”)
• Whether to test a point null hypothesis, a lump of
prior probability on H0
Xi ~ N(μ, σ2)
H0: μ = 0 vs. H1: μ ≠ 0.
• Depending on how you spike and how you smear,
an α significant result can even correspond to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
36
• A deeper casualty is assuming there ought to be
agreement between quantities measuring
different things
37
7. Battles between officials,
agencies, journal editors—and
their (unintended) consequences
38
ASA (President’s) Task Force on
Statistical Significance and
Replicability (2019-2021)
The Task Force (1 page) states:
“P-values and significance testing, properly applied
and interpreted, are important tools that should not be
abandoned.”
“Much of the controversy surrounding statistical
significance can be dispelled through a better
appreciation of uncertainty, variability, multiplicity, and
replicability”. (Benjamini et al. 2021)
39
The ASA President’s Task Force:
Linda Young, National Agric Stats, U of Florida (Co-Chair)
Xuming He, University of Michigan (Co-Chair)
Yoav Benjamini, Tel Aviv University
Dick De Veaux, Williams College (ASA Vice President)
Bradley Efron, Stanford University
Scott Evans, George Washington U (ASA Pubs Rep)
Mark Glickman, Harvard University (ASA Section Rep)
Barry Graubard, National Cancer Institute
Xiao-Li Meng, Harvard University
Vijay Nair, Wells Fargo and University of Michigan
Nancy Reid, University of Toronto
Stephen Stigler, The University of Chicago
Stephen Vardeman, Iowa State University
Chris Wikle, University of Missouri
40
The task force was created to stem
casualties of an ASA Director’s
editorial (2019)*
• “declarations of ‘statistical significance’ be
abandoned” (Wasserstein, Schirm & Lazar 2019)
• You may use P-values, but don’t assess them by
preset thresholds (e.g., .05, .01,.005): No
significance/no threshold view
*2022 disclaimer
41
Some (unintended) casualties
• Appearance that statistics is withdrawing tools for a
major task to which scientists look to statistics: to
distinguish genuine effects from noise.
• And even that this is ASA policy, which it’s not
42
Most serious casualty: Researchers lost little time:
“Given the recent discussions to abandon
significance testing it may be useful to move away
from controlling type I error entirely in trial
designs.” (Ryan et al. 2020, radiation oncology)
Useful for whom?
43
Not for our skeptical consumer of
statistics
• To evaluate a researcher’s claim of benefits of a
radiation treatment, she wants to know: How many
chances did they give themselves to find benefit
even if spurious (data dredging, optional stopping)
• Not enough that their informative prior favors the
intervention—“trust us, we’re Bayesians”
44
No tests, no falsification
• If you cannot say about any results, ahead of time,
they will not be allowed to count in favor of a claim
C — if you deny any threshold — then you do not
have a test of C
• Most would balk at methods with error probabilities
over 50% — violating Cox’s weak repeated
sampling principle
• N-P had an undecidable region
45
Some say: We do not worry about
Type I error control: All null
hypotheses are false?
1. The claim “We know all nulls are false” boils down
to all models are strictly idealizations—but it does
not follow you know all effects are real
2. Not just Type I errors go, all error probabilities,
Type II, magnitude, sign depend on the sampling
distribution
46
Reformulate tests
• I’ve long argued against misuses of significance
tests
I introduce a reformulation of tests in terms of
discrepancies (effect sizes) that are and are not
severely-tested SEV(Test T, data x, claim C)
• In a nutshell: one tests several discrepancies
from a test hypothesis and infers those well or
poorly warranted
Mayo1991-2018; Mayo and Spanos (2006); Mayo and Cox (2006);
Mayo and Hand (2022)
47
Avoid misinterpreting a 2SE
significant result
48
What about fallacies of
non-significant results?
• Not evidence of no discrepancy, but not
uninformative even for simple significance tests—
• Minimally: Test wasn’t capable of distinguishing
the effect from sampling variability
• May also be able to find upper bounds μ1
49
Setting upper bounds
50
Severity vs Power
51
Why do some accounts say a result just significant
at level α is stronger evidence of (𝝁 > 𝝁1) as
POW(𝝁1) increases?
One explanation is the following comparative analysis:
Let x = Test T rejects H0 just at α = .02
Pr(x;!1)
Pr(x;!0)
=
POW(!1)
α
POW(µ1) = Pr(Test T rejects H0; µ1)—the numerator.
As µ1 increases, POW(µ1) in numerator increases, so
the more evidence (µ > µ1)—but this is wrong!
52
Recap: Mayo
53
The skeptical consumer of statistics: show me what
you’ve done to rule out ways you can be wrong.
• Biasing selection effects alter a method’s error
probing capacities
These endanger all methods, but many methods lack
the antenna to pick up on them.
54
Fisherian and N-P tests can block threats to
error control, but pathological battles result in
their being viewed as an inconsistent hybrid
• Where Fisherians can’t use power, N-P can’t
report attained P-values—forfeits features
they each need
55
Can keep the best from both Fisher and N-P: Use
error probabilities inferentially
• What alters error probabilities
• alters error probing capabilities
• alters well testedness
56
Rivals to error statistical accounts condition on the
data: import of data is through likelihood ratios (LP)
(e.g., Bayes factors, likelihood ratios)
So error probabilities drop out
• To the LP holder: what could have happened
but didn’t is to consider “imaginary data”
• To the severe tester, probabilists are robbed
from a main way to block spurious results
The error statistician and LP holders talk past each
other
57
Bayesians may block inferences based on
biasing selection effects without appealing to
error probabilities:
• high prior belief probabilities to H0 (no effect)
can result in a high posterior probability to H0:
Casualties:
• Puts blame in wrong place
• How to obtain and interpret them
• Increased flexibility
58
Recent feuds among statistical thought-leaders
lead some to recommend “abandoning”
significance & P-value thresholds
Casualties:
• A bad argument: don’t use a method because it
may be used badly.
• No thresholds, no tests, no falsification
• Harder to hold researchers accountable for
biasing selection effects
• No tests of assumptions
Mayo and Hand (2022)
59
• We reformulate tests to report the extent of
discrepancies that are and are not indicated
with severity
• Avoids fallacies
• Reveals casualties of equating concepts
from schools with different aims
• If time:
confidence intervals are also improved;
with CIs it’s “the CI only” movement that’s the
casualty
60
61
In appraising statistical reforms ask:
• what’s their notion of probability?*
• What’s their account of statistical evidence
(LP?)
*If the parameter has a genuine frequentist
distribution, frequentists can use it too—
deductive updating
62
• A silver lining to distinguishing rival concepts–can
use different methods for different contexts
• Some Bayesians may find their foundations for
science in error statistics
• Stop refighting the stat wars (by 2034?)
63
• Attempts to “reconcile” tools with different aims
lead to increased conceptual confusion.
64
In the context of the skeptical
consumer of statistics, methods
should be:
• directly altered by biasing selection effects
• able to falsify claims statistically,
• able to test statistical model assumptions.
• able to block inferences that violate minimal
severity
65
For those contexts: we shouldn’t
throw out the error control baby with
the bad statistics bathwater
66
67
References
• Amrhein, V., Greenland, S. and McShane B. (2019). “Comment: Retire Statistical Significance”, Nature
567: 305-7. (Online, 20 March 2019).
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal
for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human
Behaviour 2, 6–10
• Benjamini, Y., De Veaux, R. D., Efron, B., Evans, S., Glickman, M., Graubard, B. I., He, X., Meng, X.-L.,
Reid, N., & Stigler, S. M. (2021). The asa president's task force statement on statistical significance and
replicability. Annals of Applied Statistics, 15(3), 1084–1085.
• Berger, J. O. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and Rejoinder’,
Statistical Science 18(1), 1–12; 28–32.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph
Series. Hayward, CA: Institute of Mathematical Statistics.
• Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing
Problem”, Journal of the American Statistical Association 82(397), 106-11.
• Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-values’,
Royal Society Open Science 1(3), 140216.
• Cox, D. R. (1958). “Some Problems Connected with Statistical Inference”, Annals of Mathematical Statistics
29(2), 357-72.
• Cox, D. R. (2006). Principles of Statistical Inference. Cambridge: Cambridge University Press.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science. Mayo and Spanos (eds.), 276–304. CUP.
• Fisher, R. A. (1930). “Inverse Probability”. Mathematical Proceedings of the Cambridge Philosophical
Society 26(4), 528-35.
68
• Fisher, R. A. (1947). The Design of Experiments 4th
ed., Edinburgh: Oliver and Boyd.
• Fisher, R. A. (1955). “Statistical Methods and Scientific Induction”, Journal of the Royal Statistica; Society:
Series B 17(1): 69-78.
• Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. Reprinted in
R. A. Fisher (1990).
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.),
Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual
Foundation. Chicago: University of Chicago Press.
• Mayo, D. (2016). ‘Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary
on Wasserstein, R. L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and
Purpose”’, The American Statistician 70(2) (supplemental materials).
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,
Cambridge: Cambridge University Press.
• Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest. Conservation Biology, 36(1).
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J.
(ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series,
Volume 49, Institute of Mathematical Statistics: 247-275.
• Mayo, D. G. and Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or
damaging scientific practice?. Synthese 200, 220. https://doi.org/10.1007/s11229-022-03692-0
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy
of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
69
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S.
Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The
Netherlands: Elsevier.
• Neyman, J. & Pearson, E. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses”,,
Philosophical Transactions of the Royal Society of London Series A 231: 289-33 7. Reprinted in Joint
Statistical Papers, 1-66.
• Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”, Science
349(6251), 943–51.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J.
Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press).
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC
press.
• Ryan, E. G., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a
Bayesian adaptive trial design? BMC Medical Research Methodology, 20(1). https://doi.org/10.1186/s12874-
020-01042-7
• Selvin, H. (1970). “A critique of tests of significance in survey research. In The significance test controversy,
edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
• Senn, S. (2002). A comment on replication, p-values and evidence, s.n.goodman, statistics in medicine
1992; 11:875-879. Statistics in Medicine, 21(16), 2437–44.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7.
• Simonsohn, U. (2013). Just post it: the lesson from two cases of fabricated data detected by statistics
alone. Psychological Science, 24(10), 1875–1888.
• Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”,
The American Statistician 70(2), 129–33.
• Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The
American Statistician 73(S1): 1-19.

More Related Content

Similar to The Statistics Wars and Their Casualties (w/refs) (20)

PPTX
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
jemille6
 
PDF
“The importance of philosophy of science for statistical science and vice versa”
jemille6
 
PPTX
Replication Crises and the Statistics Wars: Hidden Controversies
jemille6
 
PPTX
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
jemille6
 
PDF
D. G. Mayo Columbia slides for Workshop on Probability &Learning
jemille6
 
PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
PPTX
Severe Testing: The Key to Error Correction
jemille6
 
PDF
Philosophy of Science and Philosophy of Statistics
jemille6
 
PPTX
Mayo &amp; parker spsp 2016 june 16
jemille6
 
PPTX
Controversy Over the Significance Test Controversy
jemille6
 
PPTX
The replication crisis: are P-values the problem and are Bayes factors the so...
StephenSenn2
 
PDF
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 
PPTX
D.G. Mayo Slides LSE PH500 Meeting #1
jemille6
 
PDF
D.g. mayo 1st mtg lse ph 500
jemille6
 
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
jemille6
 
PPTX
Mayo minnesota 28 march 2 (1)
jemille6
 
PDF
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
jemille6
 
PDF
Hypothesis in rm
ambujmahajan
 
PPTX
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
jemille6
 
PPTX
What should we expect from reproducibiliry
Stephen Senn
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
jemille6
 
Replication Crises and the Statistics Wars: Hidden Controversies
jemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
jemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
jemille6
 
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
Severe Testing: The Key to Error Correction
jemille6
 
Philosophy of Science and Philosophy of Statistics
jemille6
 
Mayo &amp; parker spsp 2016 june 16
jemille6
 
Controversy Over the Significance Test Controversy
jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
StephenSenn2
 
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
jemille6
 
D.g. mayo 1st mtg lse ph 500
jemille6
 
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
jemille6
 
Mayo minnesota 28 march 2 (1)
jemille6
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
jemille6
 
Hypothesis in rm
ambujmahajan
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
jemille6
 
What should we expect from reproducibiliry
Stephen Senn
 

More from jemille6 (20)

PDF
D. Mayo JSM slides v2.pdf
jemille6
 
PDF
reid-postJSM-DRC.pdf
jemille6
 
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
jemille6
 
PDF
Causal inference is not statistical inference
jemille6
 
PDF
What are questionable research practices?
jemille6
 
PDF
What's the question?
jemille6
 
PDF
The neglected importance of complexity in statistics and Metascience
jemille6
 
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
jemille6
 
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
jemille6
 
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
PPTX
Good Data Dredging
jemille6
 
PDF
The Duality of Parameters and the Duality of Probability
jemille6
 
PDF
Error Control and Severity
jemille6
 
PDF
The Statistics Wars and Their Causalities (refs)
jemille6
 
PDF
On the interpretation of the mathematical characteristics of statistical test...
jemille6
 
PDF
The role of background assumptions in severity appraisal (
jemille6
 
PDF
The two statistical cornerstones of replicability: addressing selective infer...
jemille6
 
PDF
The Statistics Wars and Their Casualties
jemille6
 
PDF
The ASA president Task Force Statement on Statistical Significance and Replic...
jemille6
 
D. Mayo JSM slides v2.pdf
jemille6
 
reid-postJSM-DRC.pdf
jemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
jemille6
 
Causal inference is not statistical inference
jemille6
 
What are questionable research practices?
jemille6
 
What's the question?
jemille6
 
The neglected importance of complexity in statistics and Metascience
jemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
Good Data Dredging
jemille6
 
The Duality of Parameters and the Duality of Probability
jemille6
 
Error Control and Severity
jemille6
 
The Statistics Wars and Their Causalities (refs)
jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
jemille6
 
The role of background assumptions in severity appraisal (
jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
jemille6
 
The Statistics Wars and Their Casualties
jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
jemille6
 

Recently uploaded (20)

PPTX
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
PPTX
CBSE to Conduct Class 10 Board Exams Twice a Year Starting 2026 .pptx
Schoolsof Dehradun
 
PPTX
CONVULSIVE DISORDERS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PDF
Federal dollars withheld by district, charter, grant recipient
Mebane Rash
 
PDF
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
PDF
Zoology (Animal Physiology) practical Manual
raviralanaresh2
 
PPTX
classroom based quiz bee.pptx...................
ferdinandsanbuenaven
 
PPTX
How to Manage Promotions in Odoo 18 Sales
Celine George
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PDF
Comprehensive Guide to Writing Effective Literature Reviews for Academic Publ...
AJAYI SAMUEL
 
PPTX
Modern analytical techniques used to characterize organic compounds. Birbhum ...
AyanHossain
 
PPTX
Blanket Order in Odoo 17 Purchase App - Odoo Slides
Celine George
 
PDF
BÀI TẬP BỔ TRỢ THEO LESSON TIẾNG ANH - I-LEARN SMART WORLD 7 - CẢ NĂM - CÓ ĐÁ...
Nguyen Thanh Tu Collection
 
PPTX
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
PPTX
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
PPTX
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
PPTX
How to Configure Storno Accounting in Odoo 18 Accounting
Celine George
 
PDF
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
CBSE to Conduct Class 10 Board Exams Twice a Year Starting 2026 .pptx
Schoolsof Dehradun
 
CONVULSIVE DISORDERS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Federal dollars withheld by district, charter, grant recipient
Mebane Rash
 
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
Zoology (Animal Physiology) practical Manual
raviralanaresh2
 
classroom based quiz bee.pptx...................
ferdinandsanbuenaven
 
How to Manage Promotions in Odoo 18 Sales
Celine George
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
Comprehensive Guide to Writing Effective Literature Reviews for Academic Publ...
AJAYI SAMUEL
 
Modern analytical techniques used to characterize organic compounds. Birbhum ...
AyanHossain
 
Blanket Order in Odoo 17 Purchase App - Odoo Slides
Celine George
 
BÀI TẬP BỔ TRỢ THEO LESSON TIẾNG ANH - I-LEARN SMART WORLD 7 - CẢ NĂM - CÓ ĐÁ...
Nguyen Thanh Tu Collection
 
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
How to Configure Storno Accounting in Odoo 18 Accounting
Celine George
 
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 

The Statistics Wars and Their Casualties (w/refs)

  • 1. 1 The Statistics Wars and Their Casualties Deborah G Mayo Dept of Philosophy, Virginia Tech Research Associate, LSE September 22, 2022 Workshop: The Statistics Wars and Their Casualties LSE (CPNSS)
  • 2. 2 Begin with a question: From what perspective should we view the statistics wars? • The standpoint of the ordinary, skeptical consumer of statistics • Minimal requirement for evidence
  • 3. 3 Requirement of the skeptical statistical consumer • We have evidence for a claim C only to the extent C has been subjected to and passes a test that would probably have found it flawed or specifiably false, just if it is. • This probability is the stringency or severity with which it has passed the test.
  • 4. 4 Applies to any methods now in use • Whether for testing, estimation, prediction—or solving a problem (formal or informal)
  • 5. 5 2. Statistical Significance test battles & their ironies • Often fingered as the culprit of the replication crisis • It’s too easy to get small P-values—critics say • Replication crisis: It’s too hard to get small P-values when others try to replicate with stricter controls
  • 6. 6 • R.A. Fisher: it’s easy to lie with statistics by selective reporting, (“political principle that anything can be proved by statistics” (1955, 75)) • Sufficient finagling—cherry-picking, data-dredging, multiple testing, optional stopping—may result in a claim C appearing supported, even if it’s unwarranted—biasing selection effects
  • 7. 7 Error Statistics This underwrites the key aim of statistical significance tests: • To bound the probabilities of erroneous interpretations of data: error probabilities A small part of a general methodology which I call error statistics (statistical tests, confidence intervals, resampling, randomization)
  • 8. 8 Fraud-busting and non-replication based on P-values • Remember when fraud-busters (Uri Simonsohn and colleagues) used statistical significance tests to expose fraud? • data too good to be true, or • inexplicable under sampling variation (Smeesters, Sanna) “Fabricated Data Detected by Statistics Alone” Simonsohn 2013
  • 9. 9 • How is it that tools relied on to show fraud, QRPs, lack of replication are said to be tools we can’t trust? (”P-values can’t be trusted unless they are used to show P-values can’t be trusted”) • Simmons et al. recommend “a 21 word solution” to state stopping rules, hypotheses, etc. in advance (Simmons et al., 2012) • Yet some reforms are at odds with this
  • 10. 10 3. Simple significance (Fisherian) tests “…to test the conformity of the particular data under analysis with H0 in some respect….” (Mayo and Cox 2006, p. 81) …the P-value: the probability the test would yield an even larger (or more extreme) value of a test statistic T assuming chance variability or noise NOT Pr(data|H0 )
  • 11. 11 Testing reasoning • Small P-values indicate* some underlying discrepancy from H0 because very probably (1- P) you would have seen a less impressive difference were H0 true. • This still isn’t evidence of a genuine statistical effect H1 yet alone a scientific conclusion H*—only abuses of tests (NHST?) commit these howlers *(until an audit is conducted testing assumptions, I use “indicate”)
  • 12. 12 Neyman and Pearson tests (1933) put Fisherian tests on firmer ground: Introduces alternative hypotheses H0, H1 H0: μ ≤ 0 vs. H1: μ > 0 • Trade-off between Type I errors and Type II errors • Restricts the inference to the statistical alternative—no jumps to H* (within a model) Tests of Statistical Hypotheses, statistical decision-making
  • 13. 13 Fisher-Neyman (pathological) battles • The success of N-P optimal error control led to a new paradigm in statistics, overshadows Fisher • “being in the same building at University College London brought them too close to one another”! (Cox 2006, 195)
  • 14. 14 Contemporary casualties of Fisher- Neyman (N-P) battles • N-P & Fisher tests claimed to be an “inconsistent hybrid” where: • Fisherians can’t use power; N-P testers can’t report P-values (P =) but only fixed error probabilities (P <) • In fact, Fisher & N-P recommended both pre- data error probabilities and post-data P-value
  • 15. 15 What really happened concerns Fisher’s “fiducial probability” The fallacy in familiar terms: Fisher claimed the confidence level measures both error control and post-data probability on statistical hypotheses without prior probabilities—in special cases But it leads to inconsistencies
  • 16. 16 “[S]o many people assumed for so long that the [fiducial] argument was correct. They lacked the daring to question it.” (Good 1971, p. 138). • Neyman did, develops confidence intervals (performance rationale)
  • 17. 17 Do we need to know the history to get beyond the statistics wars? No, we shouldn’t be hamstrung by battles from 70, 80 or 90 years ago, or to what some of today’s discussants think they were about. “It’s the methods, stupid” (Mayo 2018, 164)
  • 18. 18 • Key question remains (from the fiducial battle): how to have a post data quantification of epistemic warrant (but not a posterior probability)? • Severity? Calibration?
  • 19. 19 Sir David Cox’s statistical philosophy • We need to calibrate methods: how would they behave in (actual or hypothetical) repeated sampling? (performance) o Weak repeated sampling: “any proposed method of analysis that in repeated application would mostly give misleading answers is fatally flawed” (Cox 2006, 198)
  • 20. 20 Good performance not sufficient for an inference measure (post data) Cox’s “weighing machine” example in 1958 How can we ensure the calibration is relevant, (taking account of how the data were obtained) without leading to the unique case, precluding error probabilities? (Cox 2006, 198) “Objectivity and Conditionality” (Cox and Mayo 2010)
  • 21. 21 4. Rivals to error statistical accounts condition on the unique case All the evidence is via likelihood ratios (LR) of hypotheses Pr(x0;H1)/Pr(x0;H0) The data x0 are fixed, while the hypotheses vary • Any hypothesis that perfectly fits the data is maximally likely
  • 22. 22 All error probabilities violate the LP • “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, 436)
  • 25. 25 Many “reforms” offered as alternative to significance tests, follow the LP • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri et al. 2016, 100) • “It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself”. (Berger and Wolpert 1988, 78) (Stopping Rule Principle)
  • 26. In testing the mean of a standard normal distribution 26
  • 27. 27 The LP parallels the holy grail of logics of induction C(h,e) I was brought up on C(h,e), but it doesn’t work. Popper (a falsificationist): “we shall simply deceive ourselves if we think we can interpret C(h,e) as degree of corroboration, or anything like it.” (Popper 1959, 418). He never fleshed out severity
  • 28. 28 Fisher, Neyman, Pearson were allergic to the idea of a single rule for ideally rational inference • Their philosophy of statistics was pragmatic: to control human biases. (design, planning, RCTs, predesignated power) • But there’s a link to formal statistics: the biases directly alter the method’s error probabilities • Not automatic, requires background knowledge
  • 29. 29 5. Bayesians: we can block inferences based on biasing selection effects with prior beliefs (without error probabilities)
  • 30. 30 Casualties • Doesn’t show what researchers had done wrong— battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive • Additional source of flexibility, priors as well as biasing selection effects
  • 31. 31 Peace Treaty (J. Berger 2003, 2006): “default” (“objective”) priors • Elicitation problems: “[V]irtually never would different experts give prior distributions that even overlapped” (J. Berger 2006, 392) • Default priors are to prevent prior beliefs from influencing the posteriors–data dominant
  • 32. 32 Casualties • “The priors are not to be considered expressions of uncertainty, …may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems* for default/non-subjective priors • The reconciliation leads to violations of the LP, forfeiting Bayesian coherence while not fully error statistical (casualty for Bayesians?) *Invariance, maximum entropy, frequentist matching
  • 33. 33 6. A key battle in the statistics wars (old and new): P-values vs posteriors • P-value can be small, but Pr(H0|x) not small, or even large. • T o a Bayesian this shows P-values exaggerate evidence against.
  • 34. 34 • “[T]he reason that Bayesians can regard P- values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree sharply with each other“ (Stephen Senn 2002, 2442)
  • 35. 35 Some regard this as a Bayesian family feud (“spike and smear”) • Whether to test a point null hypothesis, a lump of prior probability on H0 Xi ~ N(μ, σ2) H0: μ = 0 vs. H1: μ ≠ 0. • Depending on how you spike and how you smear, an α significant result can even correspond to Pr(H0|x) = (1 – α)! (e.g., 0.95)
  • 36. 36 • A deeper casualty is assuming there ought to be agreement between quantities measuring different things
  • 37. 37 7. Battles between officials, agencies, journal editors—and their (unintended) consequences
  • 38. 38 ASA (President’s) Task Force on Statistical Significance and Replicability (2019-2021) The Task Force (1 page) states: “P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned.” “Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability”. (Benjamini et al. 2021)
  • 39. 39 The ASA President’s Task Force: Linda Young, National Agric Stats, U of Florida (Co-Chair) Xuming He, University of Michigan (Co-Chair) Yoav Benjamini, Tel Aviv University Dick De Veaux, Williams College (ASA Vice President) Bradley Efron, Stanford University Scott Evans, George Washington U (ASA Pubs Rep) Mark Glickman, Harvard University (ASA Section Rep) Barry Graubard, National Cancer Institute Xiao-Li Meng, Harvard University Vijay Nair, Wells Fargo and University of Michigan Nancy Reid, University of Toronto Stephen Stigler, The University of Chicago Stephen Vardeman, Iowa State University Chris Wikle, University of Missouri
  • 40. 40 The task force was created to stem casualties of an ASA Director’s editorial (2019)* • “declarations of ‘statistical significance’ be abandoned” (Wasserstein, Schirm & Lazar 2019) • You may use P-values, but don’t assess them by preset thresholds (e.g., .05, .01,.005): No significance/no threshold view *2022 disclaimer
  • 41. 41 Some (unintended) casualties • Appearance that statistics is withdrawing tools for a major task to which scientists look to statistics: to distinguish genuine effects from noise. • And even that this is ASA policy, which it’s not
  • 42. 42 Most serious casualty: Researchers lost little time: “Given the recent discussions to abandon significance testing it may be useful to move away from controlling type I error entirely in trial designs.” (Ryan et al. 2020, radiation oncology) Useful for whom?
  • 43. 43 Not for our skeptical consumer of statistics • To evaluate a researcher’s claim of benefits of a radiation treatment, she wants to know: How many chances did they give themselves to find benefit even if spurious (data dredging, optional stopping) • Not enough that their informative prior favors the intervention—“trust us, we’re Bayesians”
  • 44. 44 No tests, no falsification • If you cannot say about any results, ahead of time, they will not be allowed to count in favor of a claim C — if you deny any threshold — then you do not have a test of C • Most would balk at methods with error probabilities over 50% — violating Cox’s weak repeated sampling principle • N-P had an undecidable region
  • 45. 45 Some say: We do not worry about Type I error control: All null hypotheses are false? 1. The claim “We know all nulls are false” boils down to all models are strictly idealizations—but it does not follow you know all effects are real 2. Not just Type I errors go, all error probabilities, Type II, magnitude, sign depend on the sampling distribution
  • 46. 46 Reformulate tests • I’ve long argued against misuses of significance tests I introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-tested SEV(Test T, data x, claim C) • In a nutshell: one tests several discrepancies from a test hypothesis and infers those well or poorly warranted Mayo1991-2018; Mayo and Spanos (2006); Mayo and Cox (2006); Mayo and Hand (2022)
  • 47. 47 Avoid misinterpreting a 2SE significant result
  • 48. 48 What about fallacies of non-significant results? • Not evidence of no discrepancy, but not uninformative even for simple significance tests— • Minimally: Test wasn’t capable of distinguishing the effect from sampling variability • May also be able to find upper bounds μ1
  • 51. 51 Why do some accounts say a result just significant at level α is stronger evidence of (𝝁 > 𝝁1) as POW(𝝁1) increases? One explanation is the following comparative analysis: Let x = Test T rejects H0 just at α = .02 Pr(x;!1) Pr(x;!0) = POW(!1) α POW(µ1) = Pr(Test T rejects H0; µ1)—the numerator. As µ1 increases, POW(µ1) in numerator increases, so the more evidence (µ > µ1)—but this is wrong!
  • 53. 53 The skeptical consumer of statistics: show me what you’ve done to rule out ways you can be wrong. • Biasing selection effects alter a method’s error probing capacities These endanger all methods, but many methods lack the antenna to pick up on them.
  • 54. 54 Fisherian and N-P tests can block threats to error control, but pathological battles result in their being viewed as an inconsistent hybrid • Where Fisherians can’t use power, N-P can’t report attained P-values—forfeits features they each need
  • 55. 55 Can keep the best from both Fisher and N-P: Use error probabilities inferentially • What alters error probabilities • alters error probing capabilities • alters well testedness
  • 56. 56 Rivals to error statistical accounts condition on the data: import of data is through likelihood ratios (LP) (e.g., Bayes factors, likelihood ratios) So error probabilities drop out • To the LP holder: what could have happened but didn’t is to consider “imaginary data” • To the severe tester, probabilists are robbed from a main way to block spurious results The error statistician and LP holders talk past each other
  • 57. 57 Bayesians may block inferences based on biasing selection effects without appealing to error probabilities: • high prior belief probabilities to H0 (no effect) can result in a high posterior probability to H0: Casualties: • Puts blame in wrong place • How to obtain and interpret them • Increased flexibility
  • 58. 58 Recent feuds among statistical thought-leaders lead some to recommend “abandoning” significance & P-value thresholds Casualties: • A bad argument: don’t use a method because it may be used badly. • No thresholds, no tests, no falsification • Harder to hold researchers accountable for biasing selection effects • No tests of assumptions Mayo and Hand (2022)
  • 59. 59 • We reformulate tests to report the extent of discrepancies that are and are not indicated with severity • Avoids fallacies • Reveals casualties of equating concepts from schools with different aims • If time: confidence intervals are also improved; with CIs it’s “the CI only” movement that’s the casualty
  • 60. 60
  • 61. 61 In appraising statistical reforms ask: • what’s their notion of probability?* • What’s their account of statistical evidence (LP?) *If the parameter has a genuine frequentist distribution, frequentists can use it too— deductive updating
  • 62. 62 • A silver lining to distinguishing rival concepts–can use different methods for different contexts • Some Bayesians may find their foundations for science in error statistics • Stop refighting the stat wars (by 2034?)
  • 63. 63 • Attempts to “reconcile” tools with different aims lead to increased conceptual confusion.
  • 64. 64 In the context of the skeptical consumer of statistics, methods should be: • directly altered by biasing selection effects • able to falsify claims statistically, • able to test statistical model assumptions. • able to block inferences that violate minimal severity
  • 65. 65 For those contexts: we shouldn’t throw out the error control baby with the bad statistics bathwater
  • 66. 66
  • 67. 67 References • Amrhein, V., Greenland, S. and McShane B. (2019). “Comment: Retire Statistical Significance”, Nature 567: 305-7. (Online, 20 March 2019). • Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103. • Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human Behaviour 2, 6–10 • Benjamini, Y., De Veaux, R. D., Efron, B., Evans, S., Glickman, M., Graubard, B. I., He, X., Meng, X.-L., Reid, N., & Stigler, S. M. (2021). The asa president's task force statement on statistical significance and replicability. Annals of Applied Statistics, 15(3), 1084–1085. • Berger, J. O. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and Rejoinder’, Statistical Science 18(1), 1–12; 28–32. • Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem”, Journal of the American Statistical Association 82(397), 106-11. • Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-values’, Royal Society Open Science 1(3), 140216. • Cox, D. R. (1958). “Some Problems Connected with Statistical Inference”, Annals of Mathematical Statistics 29(2), 357-72. • Cox, D. R. (2006). Principles of Statistical Inference. Cambridge: Cambridge University Press. • Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Mayo and Spanos (eds.), 276–304. CUP. • Fisher, R. A. (1930). “Inverse Probability”. Mathematical Proceedings of the Cambridge Philosophical Society 26(4), 528-35.
  • 68. 68 • Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd. • Fisher, R. A. (1955). “Statistical Methods and Scientific Induction”, Journal of the Royal Statistica; Society: Series B 17(1): 69-78. • Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. Reprinted in R. A. Fisher (1990). • Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.), Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. (2016). ‘Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary on Wasserstein, R. L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and Purpose”’, The American Statistician 70(2) (supplemental materials). • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest. Conservation Biology, 36(1). • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics: 247-275. • Mayo, D. G. and Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220. https://doi.org/10.1007/s11229-022-03692-0 • Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
  • 69. 69 • Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. • Neyman, J. & Pearson, E. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses”,, Philosophical Transactions of the Royal Society of London Series A 231: 289-33 7. Reprinted in Joint Statistical Papers, 1-66. • Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”, Science 349(6251), 943–51. • Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). • Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC press. • Ryan, E. G., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a Bayesian adaptive trial design? BMC Medical Research Methodology, 20(1). https://doi.org/10.1186/s12874- 020-01042-7 • Selvin, H. (1970). “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. • Senn, S. (2002). A comment on replication, p-values and evidence, s.n.goodman, statistics in medicine 1992; 11:875-879. Statistics in Medicine, 21(16), 2437–44. • Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7. • Simonsohn, U. (2013). Just post it: the lesson from two cases of fabricated data detected by statistics alone. Psychological Science, 24(10), 1875–1888. • Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, The American Statistician 70(2), 129–33. • Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19.