On the pitfalls of Gaussian likelihood scoring for causal discovery Christoph Schultheiss and Peter B uhlmann Seminar for Statistics ETH Z urich

2025-05-02 0 0 813.63KB 13 页 10玖币
侵权投诉
On the pitfalls of Gaussian likelihood scoring for causal discovery
Christoph Schultheiss and Peter B¨uhlmann
Seminar for Statistics, ETH Z¨urich
May 4, 2023
Abstract
We consider likelihood score-based methods for causal discovery in structural causal models.
In particular, we focus on Gaussian scoring and analyze the effect of model misspecification in
terms of non-Gaussian error distribution. We present a surprising negative result for Gaussian
likelihood scoring in combination with nonparametric regression methods.
Keywords: graphical models, model misspecification, nonparametric regression, structural causal
models
1 Introduction
We consider the problem of finding the causal structure of a set of random variables X1,...Xp.
We assume that the data can be represented by a structural equation model whose structure is
given by a directed acyclic graph (DAG), say, G0, with nodes 1, . . . , p and denote by PA(j) the
parents of j. Without further assumptions on the structural causal model, one can only find G0up
to its Markov equivalence class. This can, e.g., be achieved with the PC algorithm (Spirtes et al.,
2000). Its generality comes at the price of requiring conditional independence tests, which are a
statistically hard problem.
Here, we focus on the so-called additive noise model (ANM)
XjfjXPA(j)+Ejj1, . . . , p, (1)
where the Ejare mutually independent centered random variables. This is a popular modelling
assumption that allows for better identifiability guarantees; see, e.g., (Hoyer et al., 2008; Peters
et al., 2014).
For an arbitrary DAG G, let PAG(j) be the nodes from which a directed edge towards jstarts.
Define
EG
j=XjEhXj|XPAG(j)ij1, . . . , p.
Obviously, EG0
j=Ej. Under not overly restrictive assumptions, it holds that EG
1, . . . EG
pare mutually
independent only if GG0, see Peters et al. (2014). Thus, an obvious approach to find G0is to
loop over all possible graphs and test for independence of the residuals. Of course, this becomes
infeasible when the dimensionality pgrows.
A more reasonable algorithm based on greedy search is presented in Peters et al. (2014). They
also introduce RESIT (regression with subsequent independence test) that iteratively detects sink
1
arXiv:2210.11104v2 [stat.ME] 3 May 2023
nodes. Finding the true DAG is guaranteed assuming perfect regressors and independence tests.
It involves Op2nonparametric independence tests which might be computationally involved or
lacking power when the sample size is small.
Instead of performing independence tests, one can compare the likelihood score of different
graphs
L(G) = E
log
p
Y
j=1
pG
jEG
j
=
p
X
j=1
ElogpG
jEG
j,
where pG
jdenotes the density of EG
j. Only for independent EG
j, their multivariate density factorizes
as suggested. Therefore, the true DAG maximizes this quantity due to the properties of the
Kullback-Leibler divergence. In practice, this comes with the additional difficulty of estimating the
densities pG
j(·) and one needs to add some penalization to prefer simpler graphs and avoid selecting
complete graphs; see, e.g., Nowzohour and B¨uhlmann (2016).
If one additionally assumes that the Ejare marginally normally distributed N0, σ2
jone can
instead consider the Gaussian likelihood
LN(G):=
p
X
j=1logσG
j1
21
2log(2π)=
p
X
j=1
logσG
j+C,
where σG
j2=EEG
j2(B¨uhlmann et al., 2014). It holds L(G)≥ LN(G) and LNG0=LG0.
Thus, the problem simplifies to finding the graph that leads to the lowest sum of log-variances.
Under such a normality assumption for Ejand the fj(·) in (1) being additive in their arguments,
B¨uhlmann et al. (2014) present a causal discovery method that is consistent for high-dimensional
data. To justify this approach for a broader class of error distributions, define
:= min
G6⊇G0
p
X
j=1
logσG
jlogσG0
j(2)
Then, one has to assume that
(A1) >0.
That is, the lowest possible expected negative Gaussian log-likelihood with any graph Gthat
is not a superset of the true G0must be strictly larger than the expected negative Gaussian log-
likelihood with the true graph G0. An argument for that assumption is that a true causal model
should be easier to fit in some sense and thus also obtain lower error variance. Using simple
“non-pathological” examples, we demonstrate that this can easily be a fallacy when the true error
distribution is misspecified. Thus, we advocate the need to be very careful when using Gaussian
scoring with flexible nonparametric regression functions in causal discovery. The main part of this
paper considers theoretical population properties. We discuss some data applications in Section
5.1.
2 Data-generating linear model
We consider first data-generating linear models where
fjXP A(j)=X
kP A(j)
βjkXk.
2
For these, we find the explicit Theorem 1. The intuition for this result carries over to a range
of nonlinear ANM (1), especially if the causal effects are close to linear. We present according
examples in Section 3.
If all the Ejin a data-generating linear model are Gaussian, i.e., X1, . . . , Xpare jointly Gaus-
sian, it is known that any causal order could induce the given multivariate normal distribution and
obtain an optimal Gaussian score. Assuming that the distribution is faithful with respect to the
true DAG G0, the set of the most sparse DAGs obtaining the optimal Gaussian score corresponds to
the Markov equivalence class of G0; see, e.g., Zhang and Spirtes (2008). Thus, one can obtain this
Markov equivalence class by preferring more sparse DAGs in case of equal scores. In general, the
single true DAG cannot be determined even if the full multivariate distribution is known. On the
contrary, the Linear Non-Gaussian Acyclic Model (LiNGAM) introduced in Shimizu et al. (2006) is
known to be identifiable. In such linear non-Gaussian models, algorithms designed for linear Gaus-
sian models, e.g., the PC algorithm using partial correlation to assess conditional independence,
do not use all the available information, but typically still provide the same guarantees since they
depend on the covariance structure only. The covariance matrix of data generated by a linear causal
model does not change when replacing a Gaussian Ejby an additive error of the same variance but
otherwise arbitrary distribution. Thus, assuming the faithfulness condition for the data-generating
distribution, Gaussian scoring for linear causal models in the infinite sample limit leads to the true
underlying Markov equivalence class even under misspecification of the error distribution.
If the data-generating model is not known to be linear such that nonparametric regression
methods, or the conditional mean as their population version, are applied, this generalization does
not hold true anymore, as laid out in the following theorem. Let πbe a permutation on {1, . . . , p}
and Gπthe full DAG according to π, i.e., π(k)PAGπ(π(j)) if and only if k < j.
Theorem 1. Let X1, . . . , Xpcome from a linear model:
XjX
kP A(j)
βjkXk+Ej,with E[Ej]=0,EE2
j<∞ ∀j1, . . . , p, (3)
with mutually independent Ej. Then, for every possible permutation π,
p
X
j=1
logσGπ
jlogσG0
j0.
That is, for every causal order, the corresponding full graph scores at least as well as the true causal
graph.
Furthermore,
p
X
j=1
logσGπ
jlogσG0
j<0 if
j∈ {1, . . . , p}:EhXj|XP AGπ(j)i6=X>
P AGπ(j)βπ,j where
βπ,j =EhXP AGπ(j)X>
P AGπ(j)i1EhXjXP AGπ(j)iis the least squares parameter.
That is, for every causal order, the corresponding full graph scores strictly better than the true
causal graph if at least one conditional expectation is nonlinear in the parental variables.
3
摘要:

OnthepitfallsofGaussianlikelihoodscoringforcausaldiscoveryChristophSchultheissandPeterBuhlmannSeminarforStatistics,ETHZurichMay4,2023AbstractWeconsiderlikelihoodscore-basedmethodsforcausaldiscoveryinstructuralcausalmodels.Inparticular,wefocusonGaussianscoringandanalyzethee ectofmodelmisspeci catio...

展开>> 收起<<
On the pitfalls of Gaussian likelihood scoring for causal discovery Christoph Schultheiss and Peter B uhlmann Seminar for Statistics ETH Z urich.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:813.63KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注