SlideShare a Scribd company logo
Revisiting the Two Cultures in Statistical Modeling and Inference:
the Statistics Wars and Their Potential Casualties
Aris Spanos [Virginia Tech, USA]
1. Introduction
Paradigm shifts in statistics during the 20th century
2. Karl Pearson’s descriptive statistics (1894-1920s)
The original curve-fitting
3. Fisher’s model-based statistical induction (1922)
Securing statistical adequacy and the trustworthiness of evidence
4. *Graphical Causal modeling (1990s)
Curve-fitting substantive models
5. *The nonparametric turn for model-based statistics (1970s)
Replacing ‘distribution’ assumptions with non-testable assumptions
6. Data Science (Machine Learning and all that!) (1990s)
Curve-fitting using algorithmic searches
7. Summary and Conclusions
Potential casualties of the statistics wars
1
1 Introduction
Breiman (2001): “There are two cultures in the use of statistical modeling
to reach conclusions from data. One assumes that the data are generated by
a given stochastic data model. The other uses algorithmic models and treats the
data mechanism as unknown.”
During the 20th century statistical modeling and inference experienced
several paradigm shifts, the most notable being:
Karl Pearson’s descriptive statistics (— 1920s), Fisher’s model-based
statistics (1920s), Nonparametric statistics (1970s), Graphical Causal
modeling (1990s), and Data Science (Machine Learning, Statistical Learning
Theory, etc.) (1990s).
Key points argued in the discussion that follows
• The discussions on non-replication overlook a key contributor to un-
trustworthy evidence, statistical misspecification: invalid probabilistic
assumptions imposed (explicitly or implicitly) on one’s data x0:=(1  ).
• There is a direct connection between Karl Pearson’s descriptive statistics,
Nonparametric statistics and the Data Science curve-fitting.
2
• All three approaches rely on (i) curve-fitting, (ii) goodness-of-fit/
prediction measures, and (iii) asymptotic inference results (as  →
∞) based on non-testable probabilistic/mathematical assumptions.
• The Curve-Fitting Curse: when empirical modeling relies on curve-fitting
of mathematical functions with a sufficiently large number of parame-
ters to fine-tune (e.g. neural networks, orthogonal polynomials), one will
always find a ‘best’ model on goodness-of-fit/prediction grounds, even
if that model is totally false. Worse, one will be oblivious to the fact
that such a ‘best’ model will commonly yield untrustworthy evidence!
• ‘Best’ goodness-of-fit/prediction, i.e. ‘small’ residuals/prediction errors
relative to a particular loss function, is neither necessary nor sufficient
for trustworthy evidence! What ensures the latter is the statistical adequacy
(approximate validity) of the the invoked statistical model Mθ(x) compris-
ing the probabilistic assumptions imposed one one’s data Spanos (2007).
• Trustworthy evidence stems fromprocedures whose actual error probabilities
approximate ‘closely’ the nominal ones — derived by presuming the va-
lidity of Mθ(x) That is, the trustworthiness of evidence originates in the
relevant error probabilities as they relate to the severity principle.
3
All approaches to statistics require three basic elements:
(i) substantive questions of interest—however vague or highly specific,
(ii) appropriate data x0 to shed light on these questions (learn from x0),
(iii) probabilistic assumptions imposed (implicitly or explicitly) on the observable
process { ∈N} underlying data x0 These are the assumptions that matter
for statistical inference purposes, and NOT those of any error terms.
Key differences of alternative approaches to statistics
[a] Inductive premises: their framing of the inductive premises (probabilistic
assumptions imposed on the data), and the interpretation of the selected model.
[b] Model choice: the selection of the ‘best’ model for the particular data.
[c] Inductive inference: the underlying inductive reasoning and the nature
and interpretation of their inferential claims.
[d] Substantive vs. statistical information/model: how they conciliate the
substantive (theory-based) and statistical (data-based) information.
4
2 Karl Pearson’s descriptive statistics
Data
x0:=(1  )
=⇒
Histogram of the data
=⇒
Fitting (;b
1b
2b
3b
4)
Diagram 1: Karl Pearson’s approach to statistics
One begins with the raw data x0:=(1  ), whose initial ‘rough summary’
takes the form of a histogram with  ≥ 10 bins. To provide a more succinct
descriptive summary of the histogram Pearson would use the first four raw mo-
ments of x0 to select a frequency curve within a particular family known today
as the Pearson family (F ). Members of this family are generated by:
F :  ln (;ψ)
 =[(−1)
¡
2+3+42
¢
] θ∈Θ⊂R4
 ∈R:=(−∞ ∞) (2.0.1)
that includes several well-known distributions. F is characterized by the four
unknown parameters :=(1 2 3 4) that are estimated using b
(x0)=1

P
=1 
 ,
=1 2 3 4 yielding b
θ(x0):=(b
1b
2b
3b
4). b
θ(x0) is used to select 0()∈F
5
based on the estimated curve b
(; b
θ(x0)) that ‘best’ fits the histogram using
Pearson’s (1900) goodness-of-fit test:
(X)=
X
=1
[( b
−)2
] v
→∞
2
() (2.0.2)
What Pearson and his contemporaries did not appreciate sufficiently is that, ir-
respective of whether one is summarizing the data for descriptive or infer-
ential purposes, one implicitly imposes probabilistic assumptions on the data.
For instance, the move from the raw data x0 to a histogram invokes a ‘random’
(IID) sample X:=(1  ) underlying data x0, and so do the formulae:
=1

P
=1  b
2
=1

P
=1(−)2
 =1

P
=1  b
2
=1

P
=1(−)2

b
=
h
(
P
=1(−)(−)) 
p
[
P
=1(−)2] [
P
=1(−)2]
i

when estimating ()  () ( ) etc.; see Yule (1926).
I Charging Karl Pearson with ignorance will be anachronistic since the the-
ory of stochastic processes needed to understand the concept of non-IID
samples was framed in the late 1920s early 1930s; Khitchin and Kolmogorov!
What about the current discussions on the replication crisis?
6
Amrhein, Trafimow, Greenland (2019) “Inferential statistics as descriptive statis-
tics: there is no replication crisis if we don’t expect replication”, is ill-conceived.
I The validity of the same probabilistic assumptions that underwrite the
reliability of inferences also ensure the ‘pertinence’ of descriptive statistics.
100
9 0
8 0
7 0
6 0
5 0
4 0
30
2 0
10
1
5
4
3
2
1
0
In de x
y
Case 1: t-plot of IID data y0
10 0
9 0
80
7 0
6 0
5 0
4 0
3 0
2 0
10
1
25
20
15
10
5
0
Ind e x
x
Case 2 (ID false): t-plot of data x0
Case 1: Consistent (valid)
=1

P
=1 =203
true
z }| {
[()=2]
2
=1

P
=1(−)2
=101
true
z }| {
[ ()=1]
Case 2: Inconsistent (spurious)
=1

P
=1 =121
true
z }| {
[()=2−2]
2
=1

P
=1(−)2
=3421
true
z }| {
[ ()=1]
7
Consider case 3 where the Independence assumption is invalid.
Case 3 (I false): t-plot of data z0
Case 3: Histogram of data z0
I When the IID assumptions are invalid for x0, not only the descriptive statistics,
but also the estimated frequency curve chosen (; b
θ(x0)) will be highly misleading.
8
3 Fisher’s Model-based frequentist approach
Fisher (1922) recast Pearson’s curve-fitting into modern model-based statisti-
cal induction by viewing the data x0 as a ‘typical realization’ of a parametric
statistical model, generically defined by:
Mθ(x)={(x; θ) θ∈Θ} x∈R
 for Θ⊂R
  (3.0.3)
Example. The simple Normal model is specified by:
vNIID( 2
) θ:=( 2
)∈Θ:= (R×R+)  ∈R ∈N} (3.0.4)
Mθ(x) is framed in terms of probabilistic assumptions from 3 broad categories:
(D) Distribution
Normal
Beta
Gamma
Bernoulli
.
.
.
(M) Dependence
Independence
Correlation
Markov
Martingale
.
.
.
(H) Heterogeneity
Identically Distributed
Strict Stationarity
Weak Stationarity
Separable heterogeneity
.
.
.
assigned to the stochastic process { ∈N} underlying data x0
9
These assumptions determine the joint distribution (x; θ) x∈R
 of the sample
X:=(1  ) including its parametrization θ∈Θ, as well as the likelihood
function (θ; x0)∝(x0; θ) θ∈Θ; see Spanos (1986).
Fisher proposed a complete reformulation of statistical induction by modeling
the statistical Generating Mechanism (GM) [Mθ(x)] framed in terms
of the observable stochastic process { ∈N} underlying data x0
Fisher (1922) asserts that Mθ(x) is chosen by responding to the question: “Of
what population is this a random sample?” (p. 313), and adding that “and the
adequacy of our choice may be tested posteriori.” (314). The ‘adequacy’ can be
evaluated using Mis-Specification (M-S) testing; see Spanos (2006).
That is, Mθ(x) is selected to account for the chance regularities in data x0
but its appropriateness is evaluated by M-S testing.
The primary objective of frequentist inference is to use the sample infor-
mation, as summarized by (x; θ) x∈R
 in conjunction with data x0, to
learn from data about θ∗
- true value of θ∈Θ; shorthand for saying that
Mθ∗(x)={(x; θ∗
)} x∈R
, could have generated data x0.
Learning from x0 takes the form of ‘statistical approximations’ around θ∗
, framed
in terms of the sampling distribution, (; θ) ∀∈R, of a statistic (estimator,
10
test, predictor) =(1  ) derived using two different forms of reasoning
via:
()=P( ≤ )=
Z Z
· · ·
Z
| {z }
{x: (x)≤}
(x; θ)x ∀∈R
(3.0.5)
(i) Factual (estimation and prediction): presuming that θ=θ∗
∈Θ, and
(ii) Hypothetical (hypothesis testing): various hypothetical scenarios based on
different prespecified values of θ, under 0: θ∈Θ0 (presuming that θ∈Θ0) and
1: θ∈Θ1 (presuming that θ∈Θ1) where Θ0 and Θ1 partition Θ.
I Crucially important: (i) the statistical adequacy of Mθ(x) ensures that
θ∗
lies within Mθ(x), and thus learning from data x0 is attainable.
(ii) Neither form of frequentist reasoning (factual or hypothetical) involves
conditioning on θ, an unknown constant.
(iii) The decision-theoretic reasoning, for all values of θ in Θ (∀θ∈Θ), undermines
learning from data about θ∗
, and gives rise to Stein-type paradoxes and admissibility
fallacies; Spanos (2017).
I Misspecification. When Mθ(x) is misspecified, (x; θ) is incorrect, and
this distorts (; θ) and often induces inconsistency in estimators and size-
11
able discrepancies between the actual and nominal error probabilities in Con-
fidence Intervals (CIs), testing and prediction. This is why Akaike-type model
selection procedures often go astray, since all goodness-of-fit/prediction measures
presume the validity of Mθ(x); Spanos (2010).
How can one apply Fisher’s model-based statistics when the empirical mod-
eling begins with a substantive model Mϕ(x)?
[i] Bring out the statistical model Mθ(x) implicit in Mϕ(x); there is always
one that comprises solely the probabilistic assumptions imposed on data x0! It
is defined as an unrestricted parametrization that follows from the probabilistic
assumptions imposed on the process { ∈N} underlying x0 which includes
Mϕ(x) as a special case.
[ii] Relate the substantive parameters ϕ to θ via restrictions, say g(ϕ θ)=0
ensuring that the restrictions g(ϕ θ)=0 define ϕ uniquely in terms of θ
Example. For the substantive model known as the Capital Asset Pricing:
Mϕ(z): (−2)=1(1−2)+ (|X=x) vNIID(0 2
) ∈N
Mθ(z): =0+11+22+ (|X=x) vNIID(0 2
) ∈N
g(ϕ θ)=0: 0=0 1+2−1=0 where ϕ=(1 2
) θ=(0 1 2 2
)
12
[iii] Test the validity of 0: g(ϕ θ)=0 vs. 1: g(ϕ θ)6=0 to establish
whether the substantive model Mϕ(z) belies data z0.
Main features of the Fisher model-based approach
[a] Inductive premises: Mθ(x) comprises a set of complete, internally consistent,
and testable probabilistic assumptions, relating to the observable process { ∈N}
underlying data x0 from the Distribution, Dependence and Heterogeneity categories.
Mθ(x) is viewed as a statistical stochastic mechanism aiming to account for all
the chance regularity patterns in data x0.
[b] Model choice: the appropriate Mθ(x) is chosen on statistical adequacy
grounds using comprehensive Mis-Specification (M-S) testing to ensure
that inferences are reliable: the actual ' nominal error probabilities.
[c] Inductive inference: the interpretation of probability is frequentist and
the underlying inductive reasoning is either factual (estimation, prediction)
or hypothetical (testing) and relates to learning from data about θ∗
.
The effectiveness (optimality) of inference procedures is calibrated using
error probabilities based on a statistically adequate Mθ(x)!
13
Regrettably, the replication crisis literature often confuses hypothetical reasoning
with conditioning on 0!
Diaconis and Skyrms (2018) claim (tongue-in-cheek) that p-value testers conflate
(0|x0) with (x0|0): “The untutored think they are getting the probability
of effectiveness [of a drug] given the data, while they are being given conditional
probabilities going in the opposite direction.” (p. 67)
I The ‘untutored’ know from basic probability theory that conditioning on 0:
=0 is formally illicit since  is neither an event nor a random variable!
[d] Substantive vs. statistical: the substantive model, Mϕ(x) is embedded
into a statistically adequate Mθ(x) via restrictions g(θ ϕ)=0 θ∈Θ, ϕ∈Φ
whose rejection indicates that the substantive information in Mϕ(x) belies x0!
I The above modeling strategy [i]-[iv] can be used to provide sound statis-
tical foundations for Graphical Causal Modeling that revolves around
substantive causal models [Mϕ(x)]. It will enable a harmonious blending of the
statistical with the substantive information without undermining the credibility
of either and allow for probing the validity of causal information.
14
4 Graphical Causal (GC) Modeling
Quantifying a Graphical Causal (GC) model based on directed acyclic
graphs (DAG) (Pearl, 2009; Spirtes et. al 2000) constitutes another form of
curve-fitting a priori postulated substantive model Mϕ(z).
An crucial weakness of the GC modeling is that the causal information
(substantive) is usually treated as established knowledge instead of best-
daresay conjectures whose soundness needs to be tested against data Z0.
I Foisting a DAG substantive model, Mϕ(z) on data Z0 will usually yield a
statistically and substantively misspecified model!
This is because the estimation of Mϕ(z) invokes a set of probabilistic as-
sumptions relating to the observable process {Z ∈N} underlying data Z0, the
implicit statistical model Mθ(z) whose adequacy is unknown!
I Can one guard against statistical and substantive misspecification?
Embed the DAG model into the Fisher model-based framework
Step 1. Unveil the statistical model Mθ(z) implicit in the GC model.
Step 2. Establish the statistical adequacy of Mθ(z) using comprehensive M-S
testing, and respecification when needed.
15
Substantive (GC) model
[a]  - confounder
=0+1+2+1
=0+1+2 ∈N
Statistical model for [a]
=01+11+1
=02+12+2
Substantive (GC) model
[b]  - mediator
=0+1+2+1
=0+1+3 ∈N
Statistical model for [b]
=01+11+3
=02+12+4
Substantive (GC) model
[c]  - collider
=0+1+4 ∈N
=0+1+2+5
Statistical model for [c]
=01+11+3
=02+12+4
Diagram 2: Functional Graphical Causal models
Step 3. Use a statistically adequate Mθ(z) to address the identification and
estimation of the structural parameters ϕ∈Φ
Step 4. Test the validity of the overidentifying restrictions stemming from
g(θ ϕ)=0, θ∈Θ, ϕ∈Φ.
Excellent goodness-of-fit/prediction is relevant for substantive adequacy,
which can only be probed after:
(i) establishing the statistical adequacy of Mθ(z) and
(ii) evaluating the validity of the restrictions: 0: g(θ ϕ)=0 vs. 1: g(θ ϕ)6=0
Rejecting 0 indicates that the substantive information in Mϕ(x) belies x0!
16
5 Nonparametric statistics & curve-fitting
Nonparametric statistics began in the 1970s extending Kolmogorov (1933a):
when the sample X:=(1 2  ) is IID, the empirical cdf b
()
is a good estimator of the cdf () for  large enough.
Attempts to find good estimators for the density function () ∈R led to:
(a) kernel smoothing and related techniques, including regression-type models,
(b) series estimators of b
()=
P
=0 () where {() =1 2  }
are polynomials, usually orthogonal; see Wasserman (2006).
A nonparametric statistical model is specified in terms of a broader family
F of distributions (Wasserman, 2006):
MF(x)={(x; ψ) ∈F} ψ∈Ψ x∈R

where F is defined in terms of indirect & non-testable Distribution assump-
tions such as: (a) the existence of moments up to order  ≥ 1 (see Bahadur
and Savage, 1956, on such assumptions),
(b) smoothness restrictions on the unknown density function () ∈R
(symmetry, differentiability, unimodality of () etc.).
17
Dickhaus (2018), p. 13: “Of course, the advantage of considering F is that the
issue of model misspecification, which is often problematic in parametric models, is
avoided.” Really?
Nonparametric models always impose Dependence and Heterogeneity as-
sumptions (often Independent and Identically Distributed (IID))!
What are the consequences of replacing (x; θ) with (x; ψ) ∈F?
The likelihood-based inference procedures are replaced by loss function-
based procedures driven by mathematical approximations and goodness-of-fit
measures, relying on asymptotic inference results at a high price in reliability
and precision of inference since (i) the adequacy of (x; ψ) ∈F is impossible
to establish, and (ii) the ‘indirect’ and non-testable distribution assumptions
invariably contribute substantially to the imprecision/unreliability of inference.
I As argued by Le Cam (1986), p. xiv:
“... limit theorems “as  tends to infinity” are logically devoid of content about
what happens at any particular . All they can do is suggest certain approaches
whose performance must then be checked on the case at hand. Unfortunately, the
approximation bounds we could get are too often too crude and cumbersome to be
of any practical use.”
18
6 Data Science (Machine Learning and all that!)
Big Data and Data Science includes Machine Learning (ML), Statistical
Learning Theory (SLT), pattern recognition, data mining, etc.
As claimed by Vapnik (2000): “Between 1960 and 1980 a revolution in sta-
tistics occurred: Fisher’s paradigm, ... was replaced by a new one. This para-
digm reflects a new answer to the fundamental question: What must one know
a priori about an unknown functional dependency in order to estimate it on the
basis of observations? In Fisher’s paradigm the answer was very restrictive — one
must know almost everything. Namely, one must know the desired dependency up
to the values of a finite number of parameters. ... In the new paradigm ... it
is sufficient to know some general properties of the set of functions to which the
unknown dependency belongs.” (ix).
In Fisher’s model-based approach one selects Mθ(z) to account for the
chance regularities in data Z0, and evaluates its validity before any inferences
are drawn, respecifying it when Mθ(z) is misspecified. The form of dependence
follows from the probabilistic assumptions of a statistically adequate Mθ(z).
Example. Assuming that { ∈N} is Normal, Markov and stationary, the de-
pendence is an AR(1) model =0+1−1+ (|(−1)) vNIID
¡
0 2
¢
 ∈N
19
Machine Learning views statistical modeling as an optimization prob-
lem relating to how a machine can ‘learn from data’:
(a) learner’s input: a domain set X, a label set Y
(b) training data X × Y: z:=(x ) =1 2  
(c) with an unknown distribution ∗
(z), and
(d) learner’s output: (): X → Y
The learning algorithm is all about choosing () to approximate ‘closely’
the true relationship =(x) ∈N by minimizing the distance k(x) − (x)k.
Barriers to entry? The underlying probabilistic and mathematical
framework comes in the form of functional analysis: the study of infinite-
dimensional vector (linear) spaces endowed with a topology (metric, norm,
inner product) and a probability measure.
Example. The normed linear space ([ ] k  k) of all real-valued continuous
functions () defined on [ ]⊂R with the -norm (Murphy, 2022):
kk =
³R 
 |()|

´1

 or kk = (
P
=0 |()|
)
1
  =1 2 ∞ (6.0.6)
The mathematical approximation problem is transformed into an optimization
in the context of a vector space employing powerful theorems such as the open
20
mapping, the Banach-Steinhaus, the Hahn-Banach theorems; see Carlier (2022).
To ensure the existence and uniqueness of the optimization solution, the ap-
proximation problem is often embedded in a complete inner product
vector (linear) space ([ ] k  k2) of real or complex-valued functions (X)
defined on [ ]⊂R with the 2-norm, also known as a Hilbert space of square-
integrable functions ((|X|2
)), where {X ∈N} is a stochastic process. A
Hilbert space generalizes the -dimensional Euclidean geometry to an infinite di-
mensional inner product space that allows lengths and angles to be defined
to render optimization possible.
Supervised learning (Regression). A typical example is a regression model:
=(x; ψ) + 
where (x; ψ)∈G where G is a family of smooth enough mathematical func-
tions, is approximated using data Z0:={(x) =1 2  }.
Risk functions. The problem is framed in terms of a loss function:
( (x; ψ)) ∀ψ∈Ψ ∀z∈R
 
(a) kk2 : ( (x; ψ))=(−(x; ψ))2
 (b)kk1 : ( (x; ψ))=| − (x; ψ)|
21
To render ( (x; ψ)) only a function of ∀ψ∈Ψ ∀z∈R
 is eliminated by
taking expectations wrt the distribution of the sample Z ∗
(z)-presumed un-
known, to define a risk function:
(∗
 (z; ψ))=z(( (x; ψ)))=
R
z ( (x; ψ))∗
(z)z ∀ψ∈Ψ
The statistical model implicit in Data Science is: MF(z)={∗
(z) ∈F(ψ)}
ψ∈Ψ z∈R
  and the ensuing inference revolves around the risk function
using the decision theoretic reasoning based on ∀ψ∈Ψ
Hence, Data Science tosses away all forms of frequentist inference apart from
the point estimation. Since ∗
is unobservable (∗
 (z; ψ)) is estimated
using basic descriptive statistics: b
()=1

P
=1 !
Assuming that {Z ∈N} is IID (often not stated explicitly!) one can use the
arithmetic average
b
(∗
 (z; ψ))=1

P
=1 ( (x; ψ(x)) and then minimize it to yield a con-
sistent estimator of (x; ψ):
b
(z;b
ψ(x))= arg min
∈G
[1

P
=1 ( (x; ψ(x))] (6.0.7)
where b
=b
(z;b
ψ(x)) minimizing (6.0.7) is inherently overparametrized!
22
Regularization. Depending on the dimension (effective number of parameters)
of the class of functions G (6.0.7) will usually give rise to serious overfitting
— near-interpolation! To reduce the inherent overparametrization problem ML
methods impose ad hoc restrictions on the parameters, euphemistically known
as regularization, to derive to b
=b
(z;b
φ(x)) via minimizing:
(∗
 (z; ψ)) = (∗
 (z; ψ)) + ((z; ψ)) (6.0.8)
where ((z; ψ)) is often related to the algorithmic complexity of the class G.
Example. For the LR model =β>
x +  β is estimated by minimizing:
least-squares
z }| {

X
=1
¡
−β>
x
¢2
+
regularization term
z }| {
1
X
=1
|| +2
X
=1
2
 
The idea is that regularization reduces the variance of b
β with only small increases
in its bias, which improves prediction MSE
P+
=+1(−b
β
>
x)2
(artificially!),
at the expense of learning from data; Spanos (2017).
Probably Approximately Correct (PAC) learnability refers to ‘learning’
(computationally) about ∗
(z) using a polynomial-time algorithm (
) to chose
23
(z; ψ) in G, in the form of an upper bound (Murphy, 2022):
P(max
∈G
| b
(∗
 (z; ψ))−(∗
 (z; ψ))|  )≤2 dim(G) exp(−22
)
Statistical inference in Data Science is invariably based on asymptotic results
(‘as  → ∞’) invoking IID, such as the Uniform Law of Large Number (ULLN)
for the whole family of functions G as well as more general asymptotic results de-
rived by invoking non-testable mathematical and probabilistic assumptions (‘as
 → ∞’), such as mixing conditions (Dependence) and asymptotic homogeneity
(Heterogeneity)!
Main features of the Data Science approach:
[a] Inductive premises: normed linear spaces  ( = P()) endowed with a
probability measure relating to the stochastic process {Z ∈N} The proba-
bilistic assumptions underlying {Z ∈N} are often IID, with an indirect distri-
bution assumption relating to a family of mathematical functions G and chosen
on mathematical approximation grounds.
[b] Model choice: the best fitted curve b
y=G(x;b
φ(x)) in G is chosen on
goodness-of-fit/prediction grounds or/and Akaike-type information criteria.
24
[c] Inductive inference: the interpretation of probability can be both frequentist
and Bayesian, but the underlying reasoning is decision theoretic (∀ψ∈Ψ)
which is at odds with frequentist inference. The optimality of inference proce-
dures is based on the the risk function and framed in terms of asymptotic theorems
that invoke non-testable mathematical and probabilistic assumptions. The ‘best’
fitted-curve b
y=G(x;b
φ(x)) in G is used for ‘predictive learning’.
[d] Substantive vs. statistical: the fitted curve b
y=G(x;b
φ) in G is
rendered a black box free of any statistical/substantive interpretation since
the curve-fitting and regularization imposes arbitrary restrictions on ψ to
fine-tune the prediction error. This obviates any possibility for interpreting
b
φ(x) or establishing any evidence for potential causal claims, etc.
Weaknesses of Data Science (ML, SLT, etc.) algorithmic methods
1. The Curve-Fitting Curse: viewing the modeling facet with data as an
optimization problem in the context of a Hilbert space of overparametrized
functions will always guarantee a unique solution on goodness-of-fit/prediction
grounds, trustworthiness be damned.
I Minimizing
P
=+1 (−b
(x; ψ))2
 using additional data in the testing and
25
validation facets z:=(x) =+1 +2   has no added value in learn-
ing from data when training-based choice b
=b
(x; ψ) is statistically mis-
specified. It just adds more scope for tweaking!
2. Is (θ)= − ln (θ; Z0) θ∈Θ just another loss function (Cherkassky and
Mulier, 2007, p. 31)? No! ln (θ; Z0) is based on testable probabilistic
assumptions comprising Mθ(z) as opposed to arbitrary loss functions
based on information other than data Z0 (Spanos, 2017).
3. Mathematical approximation error terms are very different from white-
noise statistical error terms. The former rely on Jackson-type upper bounds
that are never statistically non-systematic. Hence, one can conflate the two er-
rors at the peril of the trusthworthiness of evidence; Spanos (2010).
4. What patterns? Curve-fitting using mathematical approximation
patterns is very different from recurring chance regularity patterns in
data Z0 that relate directly to probabilistic assumptions. Indeed, the former
seek approximation patterns which often invoke the validity of certain proba-
bilistic assumptions. For instance, supervised and unsupervised learning using
scatterplots invokes IID for Z0; Wilmot (2019), p. 66.
26
Example. For data Z0:={( ) =1 2 } the scatterplot presupposes IID!
Unfortunately, the IID assumptions are false for both data series, shown below,
that exhibit trending means (non-ID) and irregular cycles (non-I).
27
4. How reliable are Data Science inferences? The training/testing/validation
split of the data can improve the selected models on prediction grounds, but
will nothing not secure the reliability of inference.
5. It contrast to PAC learnability that takes the fitted b
y=G(x;b
φ(x))∈G
at face value to learn about ∗
(z) the learning in Fisher’s model-based sta-
tistics stems from (z; θ) z∈R
 to (x; ψ(θ))=(()|X), where the prob-
abilistic structure of (z; θ) determines =(x; ψ(θ)) as well as ψ(θ)
6. The impression in Data Science is that the combination of: (i) a very large
sample size  for data Z0, (ii) the training/testing/validation split, (iii) the as-
ymptotic inference, renders the statistical adequacy problem irrelevant,
is an illusion! Departures from IID will render both the reliability and pre-
cision worse & worse as  increases (Spanos & McGuirk, 2001). Moreover,
invoking limit theorems ‘as  → ∞’ based on non-testable Dependence and
Heterogeneity is another head game.
On a positive note, ML can be useful when: (i) the data Z0 is (luckily) IID,
(ii) Z includes a large number of variables, (iii) one has meager substantive
information, and (iv) the sole objective is a short horizon ‘makeshift’ prediction.
28
7 Summary and conclusions
7.1 ‘Learning from data’ about phenomena of interest
Breiman’s (2001) claim that in Fisher’s paradigm "One assumes that the data are
generated by a given stochastic data model" refers to a common erroneous im-
plementation of model-based statistics where Mθ(z) is viewed as a priori
postulated model — presumed to be valid no matter what; see Spanos (1986).
In fact, Fisher (1922), p. 314, emphasized the crucial importance of model validation:
“For empirical as the specification of the hypothetical population [Mθ(z)] may be,
this empiricism is cleared of its dangers if we can apply a rigorous and objective test
of the adequacy with which the proposed population [Mθ(z)] represents the whole
of the available facts.” i.e. Mθ(z) accounts for all the chance regularities in Z0
Fisher’s parametric model-based [Mθ(z)] statistics, relying on strong
(not weak) probabilistic assumptions that are validated vis-a-vis data Z0, pro-
vide the best way to learn from data using ‘statistical approximations’ around
θ∗
, framed in terms of sampling distributions of ‘statistics’ because they secure
the effectiveness (reliability and precision) of inference and the trustworthi-
ness of the ensuing evidence.
29
The Data Science algorithmic and the Graphical Causal (GC) modeling approaches
share an inbuilt proclivity to side-step the statistical misspecification prob-
lem. The obvious way to improve the trustworthiness of their evidence is to in-
tegrate them within a broad Fisher model-based statistical framework.
In turn, sophisticated algorithms can enhance the model-based approach in sev-
eral ways, including more thorough M-S testing.
That, of course, would take a generation to be implemented mainly due to the
pronounced differences in culture and terminology!
In the meantime the trustworthiness of evidence in Data Science, can be ame-
liorated using simple M-S testing to evaluate the non-systematicity of the resid-
uals from the fitted curve b
=(x;b
ψ), b
=−b
 =1 2  .
It is important to emphasize that statistical ‘excellent’ prediction is NOT
just small prediction errors relative to a loss function, but non-systematic and
‘small’ prediction errors relative to likelihood-based goodness-of-prediction mea-
sures; see Spanos (2007).
"All models are wrong, but some are useful!" NO statistically misspecified model
is useful for ‘learning from data’ about phenomena of interest!
30
7.2 Potential casualties of the STATISTICS WARS
(1) Frequentist inference, in general, and hypothesis testing, in particu-
lar, as well as the frequentist underlying reasoning: factual and hypothetical.
(2) Error probabilities and their key role in securing the trustworthiness
of evidence by controlling & evaluating how severely tested claims are, in-
cluding:
(a) Statistical adequacy: does Mθ(z) account for the chance regularities
in data Z0?
(b) Substantive adequacy: does the model Mϕ(z) shed adequate light
on (describes, explains, predicts) the phenomenon of interest?
(3) Mis-Specification (M-S) testing and respecification to account for
the chance regularity patterns exhibited by data Z0, and ensure that the sub-
stantive information does not belie the data.
(4) Learning from data about phenomena of interest. Minimizing a risk
function to reduce the overall Mean Square Prediction Error (MSPE) ‘∀ψ∈Ψ’
undermines learning from Z0 about ψ∗
; Spanos (2017).
Thanks for listening!
31

More Related Content

Similar to Revisiting the Two Cultures in Statistical Modeling and Inference as they relate to the Statistics Wars and Their Potential Casualties (20)

PPTX
Mayod@psa 21(na)
DeborahMayo4
 
PDF
“The importance of philosophy of science for statistical science and vice versa”
jemille6
 
PDF
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
jemille6
 
PDF
The Statistics Wars and Their Causalities (refs)
jemille6
 
PDF
The Statistics Wars and Their Casualties (w/refs)
jemille6
 
PDF
The Statistics Wars and Their Casualties
jemille6
 
PPTX
abdi research ppt.pptx
AbdetaBirhanu
 
PDF
Statistics.pdf references for teaching stat
JenelynLinasGoco
 
PPTX
Statistics online lecture 01.pptx
IkramUlhaq93
 
PDF
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
jemille6
 
PDF
Chapter One Introduction To Business Statistics
Lizinis Cassendra Frederick Dony
 
PDF
Severity as a basic concept in philosophy of statistics
jemille6
 
PDF
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
NUI Galway
 
PDF
2013.03.26 Bayesian Methods for Modern Statistical Analysis
NUI Galway
 
PDF
Discussion of Persi Diaconis' lecture at ISBA 2016
Christian Robert
 
PPTX
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
Dexlab Analytics
 
PPTX
Basics of Educational Statistics (Inferential statistics)
HennaAnsari
 
PPTX
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
jemille6
 
DOCX
Data Mining Avoiding False DiscoveriesLecture Notes for Chapt
OllieShoresna
 
PDF
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
jemille6
 
Mayod@psa 21(na)
DeborahMayo4
 
“The importance of philosophy of science for statistical science and vice versa”
jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
jemille6
 
The Statistics Wars and Their Causalities (refs)
jemille6
 
The Statistics Wars and Their Casualties (w/refs)
jemille6
 
The Statistics Wars and Their Casualties
jemille6
 
abdi research ppt.pptx
AbdetaBirhanu
 
Statistics.pdf references for teaching stat
JenelynLinasGoco
 
Statistics online lecture 01.pptx
IkramUlhaq93
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
jemille6
 
Chapter One Introduction To Business Statistics
Lizinis Cassendra Frederick Dony
 
Severity as a basic concept in philosophy of statistics
jemille6
 
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
NUI Galway
 
2013.03.26 Bayesian Methods for Modern Statistical Analysis
NUI Galway
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Christian Robert
 
Basic of Statistical Inference Part-III: The Theory of Estimation from Dexlab...
Dexlab Analytics
 
Basics of Educational Statistics (Inferential statistics)
HennaAnsari
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
jemille6
 
Data Mining Avoiding False DiscoveriesLecture Notes for Chapt
OllieShoresna
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
jemille6
 

More from jemille6 (20)

PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
jemille6
 
PDF
D. Mayo JSM slides v2.pdf
jemille6
 
PDF
reid-postJSM-DRC.pdf
jemille6
 
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
jemille6
 
PDF
Causal inference is not statistical inference
jemille6
 
PDF
What are questionable research practices?
jemille6
 
PDF
What's the question?
jemille6
 
PDF
The neglected importance of complexity in statistics and Metascience
jemille6
 
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
jemille6
 
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
PPTX
Good Data Dredging
jemille6
 
PDF
The Duality of Parameters and the Duality of Probability
jemille6
 
PDF
Error Control and Severity
jemille6
 
PDF
On the interpretation of the mathematical characteristics of statistical test...
jemille6
 
PDF
The role of background assumptions in severity appraisal (
jemille6
 
PDF
The two statistical cornerstones of replicability: addressing selective infer...
jemille6
 
PDF
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 
PDF
The ASA president Task Force Statement on Statistical Significance and Replic...
jemille6
 
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
jemille6
 
D. Mayo JSM slides v2.pdf
jemille6
 
reid-postJSM-DRC.pdf
jemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
jemille6
 
Causal inference is not statistical inference
jemille6
 
What are questionable research practices?
jemille6
 
What's the question?
jemille6
 
The neglected importance of complexity in statistics and Metascience
jemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
Good Data Dredging
jemille6
 
The Duality of Parameters and the Duality of Probability
jemille6
 
Error Control and Severity
jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
jemille6
 
The role of background assumptions in severity appraisal (
jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
jemille6
 
Ad

Recently uploaded (20)

PPTX
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
PPTX
ROLE OF ANTIOXIDANT IN EYE HEALTH MANAGEMENT.pptx
Subham Panja
 
PPSX
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
PDF
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
PPTX
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PPTX
Mrs Mhondiwa Introduction to Algebra class
sabinaschimanga
 
PDF
IMP NAAC REFORMS 2024 - 10 Attributes.pdf
BHARTIWADEKAR
 
PPTX
Nutrition Month 2025 TARP.pptx presentation
FairyLouHernandezMej
 
PPTX
How to Configure Access Rights of Manufacturing Orders in Odoo 18 Manufacturing
Celine George
 
PPTX
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
PPTX
How to Manage Promotions in Odoo 18 Sales
Celine George
 
PPTX
CLEFT LIP AND PALATE: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
PPTX
Explorando Recursos do Summer '25: Dicas Essenciais - 02
Mauricio Alexandre Silva
 
PPTX
Maternal and Child Tracking system & RCH portal
Ms Usha Vadhel
 
PPTX
Nutri-QUIZ-Bee-Elementary.pptx...................
ferdinandsanbuenaven
 
PPTX
CONVULSIVE DISORDERS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Views on Education of Indian Thinkers Mahatma Gandhi.pptx
ShrutiMahanta1
 
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
ROLE OF ANTIOXIDANT IN EYE HEALTH MANAGEMENT.pptx
Subham Panja
 
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
Mrs Mhondiwa Introduction to Algebra class
sabinaschimanga
 
IMP NAAC REFORMS 2024 - 10 Attributes.pdf
BHARTIWADEKAR
 
Nutrition Month 2025 TARP.pptx presentation
FairyLouHernandezMej
 
How to Configure Access Rights of Manufacturing Orders in Odoo 18 Manufacturing
Celine George
 
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
How to Manage Promotions in Odoo 18 Sales
Celine George
 
CLEFT LIP AND PALATE: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
Explorando Recursos do Summer '25: Dicas Essenciais - 02
Mauricio Alexandre Silva
 
Maternal and Child Tracking system & RCH portal
Ms Usha Vadhel
 
Nutri-QUIZ-Bee-Elementary.pptx...................
ferdinandsanbuenaven
 
CONVULSIVE DISORDERS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Views on Education of Indian Thinkers Mahatma Gandhi.pptx
ShrutiMahanta1
 
Ad

Revisiting the Two Cultures in Statistical Modeling and Inference as they relate to the Statistics Wars and Their Potential Casualties

  • 1. Revisiting the Two Cultures in Statistical Modeling and Inference: the Statistics Wars and Their Potential Casualties Aris Spanos [Virginia Tech, USA] 1. Introduction Paradigm shifts in statistics during the 20th century 2. Karl Pearson’s descriptive statistics (1894-1920s) The original curve-fitting 3. Fisher’s model-based statistical induction (1922) Securing statistical adequacy and the trustworthiness of evidence 4. *Graphical Causal modeling (1990s) Curve-fitting substantive models 5. *The nonparametric turn for model-based statistics (1970s) Replacing ‘distribution’ assumptions with non-testable assumptions 6. Data Science (Machine Learning and all that!) (1990s) Curve-fitting using algorithmic searches 7. Summary and Conclusions Potential casualties of the statistics wars 1
  • 2. 1 Introduction Breiman (2001): “There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.” During the 20th century statistical modeling and inference experienced several paradigm shifts, the most notable being: Karl Pearson’s descriptive statistics (— 1920s), Fisher’s model-based statistics (1920s), Nonparametric statistics (1970s), Graphical Causal modeling (1990s), and Data Science (Machine Learning, Statistical Learning Theory, etc.) (1990s). Key points argued in the discussion that follows • The discussions on non-replication overlook a key contributor to un- trustworthy evidence, statistical misspecification: invalid probabilistic assumptions imposed (explicitly or implicitly) on one’s data x0:=(1  ). • There is a direct connection between Karl Pearson’s descriptive statistics, Nonparametric statistics and the Data Science curve-fitting. 2
  • 3. • All three approaches rely on (i) curve-fitting, (ii) goodness-of-fit/ prediction measures, and (iii) asymptotic inference results (as  → ∞) based on non-testable probabilistic/mathematical assumptions. • The Curve-Fitting Curse: when empirical modeling relies on curve-fitting of mathematical functions with a sufficiently large number of parame- ters to fine-tune (e.g. neural networks, orthogonal polynomials), one will always find a ‘best’ model on goodness-of-fit/prediction grounds, even if that model is totally false. Worse, one will be oblivious to the fact that such a ‘best’ model will commonly yield untrustworthy evidence! • ‘Best’ goodness-of-fit/prediction, i.e. ‘small’ residuals/prediction errors relative to a particular loss function, is neither necessary nor sufficient for trustworthy evidence! What ensures the latter is the statistical adequacy (approximate validity) of the the invoked statistical model Mθ(x) compris- ing the probabilistic assumptions imposed one one’s data Spanos (2007). • Trustworthy evidence stems fromprocedures whose actual error probabilities approximate ‘closely’ the nominal ones — derived by presuming the va- lidity of Mθ(x) That is, the trustworthiness of evidence originates in the relevant error probabilities as they relate to the severity principle. 3
  • 4. All approaches to statistics require three basic elements: (i) substantive questions of interest—however vague or highly specific, (ii) appropriate data x0 to shed light on these questions (learn from x0), (iii) probabilistic assumptions imposed (implicitly or explicitly) on the observable process { ∈N} underlying data x0 These are the assumptions that matter for statistical inference purposes, and NOT those of any error terms. Key differences of alternative approaches to statistics [a] Inductive premises: their framing of the inductive premises (probabilistic assumptions imposed on the data), and the interpretation of the selected model. [b] Model choice: the selection of the ‘best’ model for the particular data. [c] Inductive inference: the underlying inductive reasoning and the nature and interpretation of their inferential claims. [d] Substantive vs. statistical information/model: how they conciliate the substantive (theory-based) and statistical (data-based) information. 4
  • 5. 2 Karl Pearson’s descriptive statistics Data x0:=(1  ) =⇒ Histogram of the data =⇒ Fitting (;b 1b 2b 3b 4) Diagram 1: Karl Pearson’s approach to statistics One begins with the raw data x0:=(1  ), whose initial ‘rough summary’ takes the form of a histogram with  ≥ 10 bins. To provide a more succinct descriptive summary of the histogram Pearson would use the first four raw mo- ments of x0 to select a frequency curve within a particular family known today as the Pearson family (F ). Members of this family are generated by: F :  ln (;ψ)  =[(−1) ¡ 2+3+42 ¢ ] θ∈Θ⊂R4  ∈R:=(−∞ ∞) (2.0.1) that includes several well-known distributions. F is characterized by the four unknown parameters :=(1 2 3 4) that are estimated using b (x0)=1  P =1   , =1 2 3 4 yielding b θ(x0):=(b 1b 2b 3b 4). b θ(x0) is used to select 0()∈F 5
  • 6. based on the estimated curve b (; b θ(x0)) that ‘best’ fits the histogram using Pearson’s (1900) goodness-of-fit test: (X)= X =1 [( b −)2 ] v →∞ 2 () (2.0.2) What Pearson and his contemporaries did not appreciate sufficiently is that, ir- respective of whether one is summarizing the data for descriptive or infer- ential purposes, one implicitly imposes probabilistic assumptions on the data. For instance, the move from the raw data x0 to a histogram invokes a ‘random’ (IID) sample X:=(1  ) underlying data x0, and so do the formulae: =1  P =1  b 2 =1  P =1(−)2  =1  P =1  b 2 =1  P =1(−)2  b = h ( P =1(−)(−))  p [ P =1(−)2] [ P =1(−)2] i  when estimating ()  () ( ) etc.; see Yule (1926). I Charging Karl Pearson with ignorance will be anachronistic since the the- ory of stochastic processes needed to understand the concept of non-IID samples was framed in the late 1920s early 1930s; Khitchin and Kolmogorov! What about the current discussions on the replication crisis? 6
  • 7. Amrhein, Trafimow, Greenland (2019) “Inferential statistics as descriptive statis- tics: there is no replication crisis if we don’t expect replication”, is ill-conceived. I The validity of the same probabilistic assumptions that underwrite the reliability of inferences also ensure the ‘pertinence’ of descriptive statistics. 100 9 0 8 0 7 0 6 0 5 0 4 0 30 2 0 10 1 5 4 3 2 1 0 In de x y Case 1: t-plot of IID data y0 10 0 9 0 80 7 0 6 0 5 0 4 0 3 0 2 0 10 1 25 20 15 10 5 0 Ind e x x Case 2 (ID false): t-plot of data x0 Case 1: Consistent (valid) =1  P =1 =203 true z }| { [()=2] 2 =1  P =1(−)2 =101 true z }| { [ ()=1] Case 2: Inconsistent (spurious) =1  P =1 =121 true z }| { [()=2−2] 2 =1  P =1(−)2 =3421 true z }| { [ ()=1] 7
  • 8. Consider case 3 where the Independence assumption is invalid. Case 3 (I false): t-plot of data z0 Case 3: Histogram of data z0 I When the IID assumptions are invalid for x0, not only the descriptive statistics, but also the estimated frequency curve chosen (; b θ(x0)) will be highly misleading. 8
  • 9. 3 Fisher’s Model-based frequentist approach Fisher (1922) recast Pearson’s curve-fitting into modern model-based statisti- cal induction by viewing the data x0 as a ‘typical realization’ of a parametric statistical model, generically defined by: Mθ(x)={(x; θ) θ∈Θ} x∈R  for Θ⊂R   (3.0.3) Example. The simple Normal model is specified by: vNIID( 2 ) θ:=( 2 )∈Θ:= (R×R+)  ∈R ∈N} (3.0.4) Mθ(x) is framed in terms of probabilistic assumptions from 3 broad categories: (D) Distribution Normal Beta Gamma Bernoulli . . . (M) Dependence Independence Correlation Markov Martingale . . . (H) Heterogeneity Identically Distributed Strict Stationarity Weak Stationarity Separable heterogeneity . . . assigned to the stochastic process { ∈N} underlying data x0 9
  • 10. These assumptions determine the joint distribution (x; θ) x∈R  of the sample X:=(1  ) including its parametrization θ∈Θ, as well as the likelihood function (θ; x0)∝(x0; θ) θ∈Θ; see Spanos (1986). Fisher proposed a complete reformulation of statistical induction by modeling the statistical Generating Mechanism (GM) [Mθ(x)] framed in terms of the observable stochastic process { ∈N} underlying data x0 Fisher (1922) asserts that Mθ(x) is chosen by responding to the question: “Of what population is this a random sample?” (p. 313), and adding that “and the adequacy of our choice may be tested posteriori.” (314). The ‘adequacy’ can be evaluated using Mis-Specification (M-S) testing; see Spanos (2006). That is, Mθ(x) is selected to account for the chance regularities in data x0 but its appropriateness is evaluated by M-S testing. The primary objective of frequentist inference is to use the sample infor- mation, as summarized by (x; θ) x∈R  in conjunction with data x0, to learn from data about θ∗ - true value of θ∈Θ; shorthand for saying that Mθ∗(x)={(x; θ∗ )} x∈R , could have generated data x0. Learning from x0 takes the form of ‘statistical approximations’ around θ∗ , framed in terms of the sampling distribution, (; θ) ∀∈R, of a statistic (estimator, 10
  • 11. test, predictor) =(1  ) derived using two different forms of reasoning via: ()=P( ≤ )= Z Z · · · Z | {z } {x: (x)≤} (x; θ)x ∀∈R (3.0.5) (i) Factual (estimation and prediction): presuming that θ=θ∗ ∈Θ, and (ii) Hypothetical (hypothesis testing): various hypothetical scenarios based on different prespecified values of θ, under 0: θ∈Θ0 (presuming that θ∈Θ0) and 1: θ∈Θ1 (presuming that θ∈Θ1) where Θ0 and Θ1 partition Θ. I Crucially important: (i) the statistical adequacy of Mθ(x) ensures that θ∗ lies within Mθ(x), and thus learning from data x0 is attainable. (ii) Neither form of frequentist reasoning (factual or hypothetical) involves conditioning on θ, an unknown constant. (iii) The decision-theoretic reasoning, for all values of θ in Θ (∀θ∈Θ), undermines learning from data about θ∗ , and gives rise to Stein-type paradoxes and admissibility fallacies; Spanos (2017). I Misspecification. When Mθ(x) is misspecified, (x; θ) is incorrect, and this distorts (; θ) and often induces inconsistency in estimators and size- 11
  • 12. able discrepancies between the actual and nominal error probabilities in Con- fidence Intervals (CIs), testing and prediction. This is why Akaike-type model selection procedures often go astray, since all goodness-of-fit/prediction measures presume the validity of Mθ(x); Spanos (2010). How can one apply Fisher’s model-based statistics when the empirical mod- eling begins with a substantive model Mϕ(x)? [i] Bring out the statistical model Mθ(x) implicit in Mϕ(x); there is always one that comprises solely the probabilistic assumptions imposed on data x0! It is defined as an unrestricted parametrization that follows from the probabilistic assumptions imposed on the process { ∈N} underlying x0 which includes Mϕ(x) as a special case. [ii] Relate the substantive parameters ϕ to θ via restrictions, say g(ϕ θ)=0 ensuring that the restrictions g(ϕ θ)=0 define ϕ uniquely in terms of θ Example. For the substantive model known as the Capital Asset Pricing: Mϕ(z): (−2)=1(1−2)+ (|X=x) vNIID(0 2 ) ∈N Mθ(z): =0+11+22+ (|X=x) vNIID(0 2 ) ∈N g(ϕ θ)=0: 0=0 1+2−1=0 where ϕ=(1 2 ) θ=(0 1 2 2 ) 12
  • 13. [iii] Test the validity of 0: g(ϕ θ)=0 vs. 1: g(ϕ θ)6=0 to establish whether the substantive model Mϕ(z) belies data z0. Main features of the Fisher model-based approach [a] Inductive premises: Mθ(x) comprises a set of complete, internally consistent, and testable probabilistic assumptions, relating to the observable process { ∈N} underlying data x0 from the Distribution, Dependence and Heterogeneity categories. Mθ(x) is viewed as a statistical stochastic mechanism aiming to account for all the chance regularity patterns in data x0. [b] Model choice: the appropriate Mθ(x) is chosen on statistical adequacy grounds using comprehensive Mis-Specification (M-S) testing to ensure that inferences are reliable: the actual ' nominal error probabilities. [c] Inductive inference: the interpretation of probability is frequentist and the underlying inductive reasoning is either factual (estimation, prediction) or hypothetical (testing) and relates to learning from data about θ∗ . The effectiveness (optimality) of inference procedures is calibrated using error probabilities based on a statistically adequate Mθ(x)! 13
  • 14. Regrettably, the replication crisis literature often confuses hypothetical reasoning with conditioning on 0! Diaconis and Skyrms (2018) claim (tongue-in-cheek) that p-value testers conflate (0|x0) with (x0|0): “The untutored think they are getting the probability of effectiveness [of a drug] given the data, while they are being given conditional probabilities going in the opposite direction.” (p. 67) I The ‘untutored’ know from basic probability theory that conditioning on 0: =0 is formally illicit since  is neither an event nor a random variable! [d] Substantive vs. statistical: the substantive model, Mϕ(x) is embedded into a statistically adequate Mθ(x) via restrictions g(θ ϕ)=0 θ∈Θ, ϕ∈Φ whose rejection indicates that the substantive information in Mϕ(x) belies x0! I The above modeling strategy [i]-[iv] can be used to provide sound statis- tical foundations for Graphical Causal Modeling that revolves around substantive causal models [Mϕ(x)]. It will enable a harmonious blending of the statistical with the substantive information without undermining the credibility of either and allow for probing the validity of causal information. 14
  • 15. 4 Graphical Causal (GC) Modeling Quantifying a Graphical Causal (GC) model based on directed acyclic graphs (DAG) (Pearl, 2009; Spirtes et. al 2000) constitutes another form of curve-fitting a priori postulated substantive model Mϕ(z). An crucial weakness of the GC modeling is that the causal information (substantive) is usually treated as established knowledge instead of best- daresay conjectures whose soundness needs to be tested against data Z0. I Foisting a DAG substantive model, Mϕ(z) on data Z0 will usually yield a statistically and substantively misspecified model! This is because the estimation of Mϕ(z) invokes a set of probabilistic as- sumptions relating to the observable process {Z ∈N} underlying data Z0, the implicit statistical model Mθ(z) whose adequacy is unknown! I Can one guard against statistical and substantive misspecification? Embed the DAG model into the Fisher model-based framework Step 1. Unveil the statistical model Mθ(z) implicit in the GC model. Step 2. Establish the statistical adequacy of Mθ(z) using comprehensive M-S testing, and respecification when needed. 15
  • 16. Substantive (GC) model [a]  - confounder =0+1+2+1 =0+1+2 ∈N Statistical model for [a] =01+11+1 =02+12+2 Substantive (GC) model [b]  - mediator =0+1+2+1 =0+1+3 ∈N Statistical model for [b] =01+11+3 =02+12+4 Substantive (GC) model [c]  - collider =0+1+4 ∈N =0+1+2+5 Statistical model for [c] =01+11+3 =02+12+4 Diagram 2: Functional Graphical Causal models Step 3. Use a statistically adequate Mθ(z) to address the identification and estimation of the structural parameters ϕ∈Φ Step 4. Test the validity of the overidentifying restrictions stemming from g(θ ϕ)=0, θ∈Θ, ϕ∈Φ. Excellent goodness-of-fit/prediction is relevant for substantive adequacy, which can only be probed after: (i) establishing the statistical adequacy of Mθ(z) and (ii) evaluating the validity of the restrictions: 0: g(θ ϕ)=0 vs. 1: g(θ ϕ)6=0 Rejecting 0 indicates that the substantive information in Mϕ(x) belies x0! 16
  • 17. 5 Nonparametric statistics & curve-fitting Nonparametric statistics began in the 1970s extending Kolmogorov (1933a): when the sample X:=(1 2  ) is IID, the empirical cdf b () is a good estimator of the cdf () for  large enough. Attempts to find good estimators for the density function () ∈R led to: (a) kernel smoothing and related techniques, including regression-type models, (b) series estimators of b ()= P =0 () where {() =1 2  } are polynomials, usually orthogonal; see Wasserman (2006). A nonparametric statistical model is specified in terms of a broader family F of distributions (Wasserman, 2006): MF(x)={(x; ψ) ∈F} ψ∈Ψ x∈R  where F is defined in terms of indirect & non-testable Distribution assump- tions such as: (a) the existence of moments up to order  ≥ 1 (see Bahadur and Savage, 1956, on such assumptions), (b) smoothness restrictions on the unknown density function () ∈R (symmetry, differentiability, unimodality of () etc.). 17
  • 18. Dickhaus (2018), p. 13: “Of course, the advantage of considering F is that the issue of model misspecification, which is often problematic in parametric models, is avoided.” Really? Nonparametric models always impose Dependence and Heterogeneity as- sumptions (often Independent and Identically Distributed (IID))! What are the consequences of replacing (x; θ) with (x; ψ) ∈F? The likelihood-based inference procedures are replaced by loss function- based procedures driven by mathematical approximations and goodness-of-fit measures, relying on asymptotic inference results at a high price in reliability and precision of inference since (i) the adequacy of (x; ψ) ∈F is impossible to establish, and (ii) the ‘indirect’ and non-testable distribution assumptions invariably contribute substantially to the imprecision/unreliability of inference. I As argued by Le Cam (1986), p. xiv: “... limit theorems “as  tends to infinity” are logically devoid of content about what happens at any particular . All they can do is suggest certain approaches whose performance must then be checked on the case at hand. Unfortunately, the approximation bounds we could get are too often too crude and cumbersome to be of any practical use.” 18
  • 19. 6 Data Science (Machine Learning and all that!) Big Data and Data Science includes Machine Learning (ML), Statistical Learning Theory (SLT), pattern recognition, data mining, etc. As claimed by Vapnik (2000): “Between 1960 and 1980 a revolution in sta- tistics occurred: Fisher’s paradigm, ... was replaced by a new one. This para- digm reflects a new answer to the fundamental question: What must one know a priori about an unknown functional dependency in order to estimate it on the basis of observations? In Fisher’s paradigm the answer was very restrictive — one must know almost everything. Namely, one must know the desired dependency up to the values of a finite number of parameters. ... In the new paradigm ... it is sufficient to know some general properties of the set of functions to which the unknown dependency belongs.” (ix). In Fisher’s model-based approach one selects Mθ(z) to account for the chance regularities in data Z0, and evaluates its validity before any inferences are drawn, respecifying it when Mθ(z) is misspecified. The form of dependence follows from the probabilistic assumptions of a statistically adequate Mθ(z). Example. Assuming that { ∈N} is Normal, Markov and stationary, the de- pendence is an AR(1) model =0+1−1+ (|(−1)) vNIID ¡ 0 2 ¢  ∈N 19
  • 20. Machine Learning views statistical modeling as an optimization prob- lem relating to how a machine can ‘learn from data’: (a) learner’s input: a domain set X, a label set Y (b) training data X × Y: z:=(x ) =1 2   (c) with an unknown distribution ∗ (z), and (d) learner’s output: (): X → Y The learning algorithm is all about choosing () to approximate ‘closely’ the true relationship =(x) ∈N by minimizing the distance k(x) − (x)k. Barriers to entry? The underlying probabilistic and mathematical framework comes in the form of functional analysis: the study of infinite- dimensional vector (linear) spaces endowed with a topology (metric, norm, inner product) and a probability measure. Example. The normed linear space ([ ] k  k) of all real-valued continuous functions () defined on [ ]⊂R with the -norm (Murphy, 2022): kk = ³R   |()|  ´1   or kk = ( P =0 |()| ) 1   =1 2 ∞ (6.0.6) The mathematical approximation problem is transformed into an optimization in the context of a vector space employing powerful theorems such as the open 20
  • 21. mapping, the Banach-Steinhaus, the Hahn-Banach theorems; see Carlier (2022). To ensure the existence and uniqueness of the optimization solution, the ap- proximation problem is often embedded in a complete inner product vector (linear) space ([ ] k  k2) of real or complex-valued functions (X) defined on [ ]⊂R with the 2-norm, also known as a Hilbert space of square- integrable functions ((|X|2 )), where {X ∈N} is a stochastic process. A Hilbert space generalizes the -dimensional Euclidean geometry to an infinite di- mensional inner product space that allows lengths and angles to be defined to render optimization possible. Supervised learning (Regression). A typical example is a regression model: =(x; ψ) +  where (x; ψ)∈G where G is a family of smooth enough mathematical func- tions, is approximated using data Z0:={(x) =1 2  }. Risk functions. The problem is framed in terms of a loss function: ( (x; ψ)) ∀ψ∈Ψ ∀z∈R   (a) kk2 : ( (x; ψ))=(−(x; ψ))2  (b)kk1 : ( (x; ψ))=| − (x; ψ)| 21
  • 22. To render ( (x; ψ)) only a function of ∀ψ∈Ψ ∀z∈R  is eliminated by taking expectations wrt the distribution of the sample Z ∗ (z)-presumed un- known, to define a risk function: (∗  (z; ψ))=z(( (x; ψ)))= R z ( (x; ψ))∗ (z)z ∀ψ∈Ψ The statistical model implicit in Data Science is: MF(z)={∗ (z) ∈F(ψ)} ψ∈Ψ z∈R   and the ensuing inference revolves around the risk function using the decision theoretic reasoning based on ∀ψ∈Ψ Hence, Data Science tosses away all forms of frequentist inference apart from the point estimation. Since ∗ is unobservable (∗  (z; ψ)) is estimated using basic descriptive statistics: b ()=1  P =1 ! Assuming that {Z ∈N} is IID (often not stated explicitly!) one can use the arithmetic average b (∗  (z; ψ))=1  P =1 ( (x; ψ(x)) and then minimize it to yield a con- sistent estimator of (x; ψ): b (z;b ψ(x))= arg min ∈G [1  P =1 ( (x; ψ(x))] (6.0.7) where b =b (z;b ψ(x)) minimizing (6.0.7) is inherently overparametrized! 22
  • 23. Regularization. Depending on the dimension (effective number of parameters) of the class of functions G (6.0.7) will usually give rise to serious overfitting — near-interpolation! To reduce the inherent overparametrization problem ML methods impose ad hoc restrictions on the parameters, euphemistically known as regularization, to derive to b =b (z;b φ(x)) via minimizing: (∗  (z; ψ)) = (∗  (z; ψ)) + ((z; ψ)) (6.0.8) where ((z; ψ)) is often related to the algorithmic complexity of the class G. Example. For the LR model =β> x +  β is estimated by minimizing: least-squares z }| {  X =1 ¡ −β> x ¢2 + regularization term z }| { 1 X =1 || +2 X =1 2   The idea is that regularization reduces the variance of b β with only small increases in its bias, which improves prediction MSE P+ =+1(−b β > x)2 (artificially!), at the expense of learning from data; Spanos (2017). Probably Approximately Correct (PAC) learnability refers to ‘learning’ (computationally) about ∗ (z) using a polynomial-time algorithm ( ) to chose 23
  • 24. (z; ψ) in G, in the form of an upper bound (Murphy, 2022): P(max ∈G | b (∗  (z; ψ))−(∗  (z; ψ))|  )≤2 dim(G) exp(−22 ) Statistical inference in Data Science is invariably based on asymptotic results (‘as  → ∞’) invoking IID, such as the Uniform Law of Large Number (ULLN) for the whole family of functions G as well as more general asymptotic results de- rived by invoking non-testable mathematical and probabilistic assumptions (‘as  → ∞’), such as mixing conditions (Dependence) and asymptotic homogeneity (Heterogeneity)! Main features of the Data Science approach: [a] Inductive premises: normed linear spaces  ( = P()) endowed with a probability measure relating to the stochastic process {Z ∈N} The proba- bilistic assumptions underlying {Z ∈N} are often IID, with an indirect distri- bution assumption relating to a family of mathematical functions G and chosen on mathematical approximation grounds. [b] Model choice: the best fitted curve b y=G(x;b φ(x)) in G is chosen on goodness-of-fit/prediction grounds or/and Akaike-type information criteria. 24
  • 25. [c] Inductive inference: the interpretation of probability can be both frequentist and Bayesian, but the underlying reasoning is decision theoretic (∀ψ∈Ψ) which is at odds with frequentist inference. The optimality of inference proce- dures is based on the the risk function and framed in terms of asymptotic theorems that invoke non-testable mathematical and probabilistic assumptions. The ‘best’ fitted-curve b y=G(x;b φ(x)) in G is used for ‘predictive learning’. [d] Substantive vs. statistical: the fitted curve b y=G(x;b φ) in G is rendered a black box free of any statistical/substantive interpretation since the curve-fitting and regularization imposes arbitrary restrictions on ψ to fine-tune the prediction error. This obviates any possibility for interpreting b φ(x) or establishing any evidence for potential causal claims, etc. Weaknesses of Data Science (ML, SLT, etc.) algorithmic methods 1. The Curve-Fitting Curse: viewing the modeling facet with data as an optimization problem in the context of a Hilbert space of overparametrized functions will always guarantee a unique solution on goodness-of-fit/prediction grounds, trustworthiness be damned. I Minimizing P =+1 (−b (x; ψ))2  using additional data in the testing and 25
  • 26. validation facets z:=(x) =+1 +2   has no added value in learn- ing from data when training-based choice b =b (x; ψ) is statistically mis- specified. It just adds more scope for tweaking! 2. Is (θ)= − ln (θ; Z0) θ∈Θ just another loss function (Cherkassky and Mulier, 2007, p. 31)? No! ln (θ; Z0) is based on testable probabilistic assumptions comprising Mθ(z) as opposed to arbitrary loss functions based on information other than data Z0 (Spanos, 2017). 3. Mathematical approximation error terms are very different from white- noise statistical error terms. The former rely on Jackson-type upper bounds that are never statistically non-systematic. Hence, one can conflate the two er- rors at the peril of the trusthworthiness of evidence; Spanos (2010). 4. What patterns? Curve-fitting using mathematical approximation patterns is very different from recurring chance regularity patterns in data Z0 that relate directly to probabilistic assumptions. Indeed, the former seek approximation patterns which often invoke the validity of certain proba- bilistic assumptions. For instance, supervised and unsupervised learning using scatterplots invokes IID for Z0; Wilmot (2019), p. 66. 26
  • 27. Example. For data Z0:={( ) =1 2 } the scatterplot presupposes IID! Unfortunately, the IID assumptions are false for both data series, shown below, that exhibit trending means (non-ID) and irregular cycles (non-I). 27
  • 28. 4. How reliable are Data Science inferences? The training/testing/validation split of the data can improve the selected models on prediction grounds, but will nothing not secure the reliability of inference. 5. It contrast to PAC learnability that takes the fitted b y=G(x;b φ(x))∈G at face value to learn about ∗ (z) the learning in Fisher’s model-based sta- tistics stems from (z; θ) z∈R  to (x; ψ(θ))=(()|X), where the prob- abilistic structure of (z; θ) determines =(x; ψ(θ)) as well as ψ(θ) 6. The impression in Data Science is that the combination of: (i) a very large sample size  for data Z0, (ii) the training/testing/validation split, (iii) the as- ymptotic inference, renders the statistical adequacy problem irrelevant, is an illusion! Departures from IID will render both the reliability and pre- cision worse & worse as  increases (Spanos & McGuirk, 2001). Moreover, invoking limit theorems ‘as  → ∞’ based on non-testable Dependence and Heterogeneity is another head game. On a positive note, ML can be useful when: (i) the data Z0 is (luckily) IID, (ii) Z includes a large number of variables, (iii) one has meager substantive information, and (iv) the sole objective is a short horizon ‘makeshift’ prediction. 28
  • 29. 7 Summary and conclusions 7.1 ‘Learning from data’ about phenomena of interest Breiman’s (2001) claim that in Fisher’s paradigm "One assumes that the data are generated by a given stochastic data model" refers to a common erroneous im- plementation of model-based statistics where Mθ(z) is viewed as a priori postulated model — presumed to be valid no matter what; see Spanos (1986). In fact, Fisher (1922), p. 314, emphasized the crucial importance of model validation: “For empirical as the specification of the hypothetical population [Mθ(z)] may be, this empiricism is cleared of its dangers if we can apply a rigorous and objective test of the adequacy with which the proposed population [Mθ(z)] represents the whole of the available facts.” i.e. Mθ(z) accounts for all the chance regularities in Z0 Fisher’s parametric model-based [Mθ(z)] statistics, relying on strong (not weak) probabilistic assumptions that are validated vis-a-vis data Z0, pro- vide the best way to learn from data using ‘statistical approximations’ around θ∗ , framed in terms of sampling distributions of ‘statistics’ because they secure the effectiveness (reliability and precision) of inference and the trustworthi- ness of the ensuing evidence. 29
  • 30. The Data Science algorithmic and the Graphical Causal (GC) modeling approaches share an inbuilt proclivity to side-step the statistical misspecification prob- lem. The obvious way to improve the trustworthiness of their evidence is to in- tegrate them within a broad Fisher model-based statistical framework. In turn, sophisticated algorithms can enhance the model-based approach in sev- eral ways, including more thorough M-S testing. That, of course, would take a generation to be implemented mainly due to the pronounced differences in culture and terminology! In the meantime the trustworthiness of evidence in Data Science, can be ame- liorated using simple M-S testing to evaluate the non-systematicity of the resid- uals from the fitted curve b =(x;b ψ), b =−b  =1 2  . It is important to emphasize that statistical ‘excellent’ prediction is NOT just small prediction errors relative to a loss function, but non-systematic and ‘small’ prediction errors relative to likelihood-based goodness-of-prediction mea- sures; see Spanos (2007). "All models are wrong, but some are useful!" NO statistically misspecified model is useful for ‘learning from data’ about phenomena of interest! 30
  • 31. 7.2 Potential casualties of the STATISTICS WARS (1) Frequentist inference, in general, and hypothesis testing, in particu- lar, as well as the frequentist underlying reasoning: factual and hypothetical. (2) Error probabilities and their key role in securing the trustworthiness of evidence by controlling & evaluating how severely tested claims are, in- cluding: (a) Statistical adequacy: does Mθ(z) account for the chance regularities in data Z0? (b) Substantive adequacy: does the model Mϕ(z) shed adequate light on (describes, explains, predicts) the phenomenon of interest? (3) Mis-Specification (M-S) testing and respecification to account for the chance regularity patterns exhibited by data Z0, and ensure that the sub- stantive information does not belie the data. (4) Learning from data about phenomena of interest. Minimizing a risk function to reduce the overall Mean Square Prediction Error (MSPE) ‘∀ψ∈Ψ’ undermines learning from Z0 about ψ∗ ; Spanos (2017). Thanks for listening! 31