On the pitfalls of Gaussian likelihood scoring for causal discovery Christoph Schultheiss and Peter B uhlmann Seminar for Statistics ETH Z urich

2025-05-02 0 0 813.63KB 13 页 10玖币

侵权投诉

On the pitfalls of Gaussian likelihood scoring for causal discovery

Christoph Schultheiss and Peter B¨uhlmann

Seminar for Statistics, ETH Z¨urich

May 4, 2023

Abstract

We consider likelihood score-based methods for causal discovery in structural causal models.

In particular, we focus on Gaussian scoring and analyze the eﬀect of model misspeciﬁcation in

terms of non-Gaussian error distribution. We present a surprising negative result for Gaussian

likelihood scoring in combination with nonparametric regression methods.

Keywords: graphical models, model misspeciﬁcation, nonparametric regression, structural causal

models

1 Introduction

We consider the problem of ﬁnding the causal structure of a set of random variables X1,...Xp.

We assume that the data can be represented by a structural equation model whose structure is

given by a directed acyclic graph (DAG), say, G0, with nodes 1, . . . , p and denote by PA(j) the

parents of j. Without further assumptions on the structural causal model, one can only ﬁnd G0up

to its Markov equivalence class. This can, e.g., be achieved with the PC algorithm (Spirtes et al.,

2000). Its generality comes at the price of requiring conditional independence tests, which are a

statistically hard problem.

Here, we focus on the so-called additive noise model (ANM)

Xj←fjXPA(j)+Ej∀j∈1, . . . , p, (1)

where the Ejare mutually independent centered random variables. This is a popular modelling

assumption that allows for better identiﬁability guarantees; see, e.g., (Hoyer et al., 2008; Peters

et al., 2014).

For an arbitrary DAG G, let PAG(j) be the nodes from which a directed edge towards jstarts.

Deﬁne

j=Xj−EhXj|XPAG(j)i∀j∈1, . . . , p.

Obviously, EG0

j=Ej. Under not overly restrictive assumptions, it holds that EG

1, . . . EG

pare mutually

independent only if G⊇G0, see Peters et al. (2014). Thus, an obvious approach to ﬁnd G0is to

loop over all possible graphs and test for independence of the residuals. Of course, this becomes

infeasible when the dimensionality pgrows.

A more reasonable algorithm based on greedy search is presented in Peters et al. (2014). They

also introduce RESIT (regression with subsequent independence test) that iteratively detects sink

arXiv:2210.11104v2 [stat.ME] 3 May 2023

nodes. Finding the true DAG is guaranteed assuming perfect regressors and independence tests.

It involves Op2nonparametric independence tests which might be computationally involved or

lacking power when the sample size is small.

Instead of performing independence tests, one can compare the likelihood score of diﬀerent

graphs

L(G) = E

log



j=1

jEG

j



=

j=1

ElogpG

jEG

j,

where pG

jdenotes the density of EG

j. Only for independent EG

j, their multivariate density factorizes

as suggested. Therefore, the true DAG maximizes this quantity due to the properties of the

Kullback-Leibler divergence. In practice, this comes with the additional diﬃculty of estimating the

densities pG

j(·) and one needs to add some penalization to prefer simpler graphs and avoid selecting

complete graphs; see, e.g., Nowzohour and B¨uhlmann (2016).

If one additionally assumes that the Ejare marginally normally distributed N0, σ2

jone can

instead consider the Gaussian likelihood

LN(G):=

j=1−logσG

j−1

2−1

2log(2π)=−

j=1

logσG

j+C,

where σG

j2=EEG

j2(B¨uhlmann et al., 2014). It holds L(G)≥ LN(G) and LNG0=LG0.

Thus, the problem simpliﬁes to ﬁnding the graph that leads to the lowest sum of log-variances.

Under such a normality assumption for Ejand the fj(·) in (1) being additive in their arguments,

B¨uhlmann et al. (2014) present a causal discovery method that is consistent for high-dimensional

data. To justify this approach for a broader class of error distributions, deﬁne

∆:= min

G6⊇G0

j=1

logσG

j−logσG0

j(2)

Then, one has to assume that

(A1) ∆>0.

That is, the lowest possible expected negative Gaussian log-likelihood with any graph Gthat

is not a superset of the true G0must be strictly larger than the expected negative Gaussian log-

likelihood with the true graph G0. An argument for that assumption is that a true causal model

should be easier to ﬁt in some sense and thus also obtain lower error variance. Using simple

“non-pathological” examples, we demonstrate that this can easily be a fallacy when the true error

distribution is misspeciﬁed. Thus, we advocate the need to be very careful when using Gaussian

scoring with ﬂexible nonparametric regression functions in causal discovery. The main part of this

paper considers theoretical population properties. We discuss some data applications in Section

5.1.

2 Data-generating linear model

We consider ﬁrst data-generating linear models where

fjXP A(j)=X

k∈P A(j)

βjkXk.

For these, we ﬁnd the explicit Theorem 1. The intuition for this result carries over to a range

of nonlinear ANM (1), especially if the causal eﬀects are close to linear. We present according

examples in Section 3.

If all the Ejin a data-generating linear model are Gaussian, i.e., X1, . . . , Xpare jointly Gaus-

sian, it is known that any causal order could induce the given multivariate normal distribution and

obtain an optimal Gaussian score. Assuming that the distribution is faithful with respect to the

true DAG G0, the set of the most sparse DAGs obtaining the optimal Gaussian score corresponds to

the Markov equivalence class of G0; see, e.g., Zhang and Spirtes (2008). Thus, one can obtain this

Markov equivalence class by preferring more sparse DAGs in case of equal scores. In general, the

single true DAG cannot be determined even if the full multivariate distribution is known. On the

contrary, the Linear Non-Gaussian Acyclic Model (LiNGAM) introduced in Shimizu et al. (2006) is

known to be identiﬁable. In such linear non-Gaussian models, algorithms designed for linear Gaus-

sian models, e.g., the PC algorithm using partial correlation to assess conditional independence,

do not use all the available information, but typically still provide the same guarantees since they

depend on the covariance structure only. The covariance matrix of data generated by a linear causal

model does not change when replacing a Gaussian Ejby an additive error of the same variance but

otherwise arbitrary distribution. Thus, assuming the faithfulness condition for the data-generating

distribution, Gaussian scoring for linear causal models in the inﬁnite sample limit leads to the true

underlying Markov equivalence class even under misspeciﬁcation of the error distribution.

If the data-generating model is not known to be linear such that nonparametric regression

methods, or the conditional mean as their population version, are applied, this generalization does

not hold true anymore, as laid out in the following theorem. Let πbe a permutation on {1, . . . , p}

and Gπthe full DAG according to π, i.e., π(k)∈PAGπ(π(j)) if and only if k < j.

Theorem 1. Let X1, . . . , Xpcome from a linear model:

Xj←X

k∈P A(j)

βjkXk+Ej,with E[Ej]=0,EE2

j<∞ ∀j∈1, . . . , p, (3)

with mutually independent Ej. Then, for every possible permutation π,

j=1

logσGπ

j−logσG0

j≤0.

That is, for every causal order, the corresponding full graph scores at least as well as the true causal

graph.

Furthermore,

j=1

logσGπ

j−logσG0

j<0 if

∃j∈ {1, . . . , p}:EhXj|XP AGπ(j)i6=X>

P AGπ(j)βπ,j where

βπ,j =EhXP AGπ(j)X>

P AGπ(j)i−1EhXjXP AGπ(j)iis the least squares parameter.

That is, for every causal order, the corresponding full graph scores strictly better than the true

causal graph if at least one conditional expectation is nonlinear in the parental variables.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnthepitfallsofGaussianlikelihoodscoringforcausaldiscoveryChristophSchultheissandPeterBuhlmannSeminarforStatistics,ETHZurichMay4,2023AbstractWeconsiderlikelihoodscore-basedmethodsforcausaldiscoveryinstructuralcausalmodels.Inparticular,wefocusonGaussianscoringandanalyzetheeectofmodelmisspecicatio...

展开>> 收起<<

On the pitfalls of Gaussian likelihood scoring for causal discovery Christoph Schultheiss and Peter B uhlmann Seminar for Statistics ETH Z urich.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the pitfalls of Gaussian likelihood scoring for causal discovery Christoph Schultheiss and Peter B uhlmann Seminar for Statistics ETH Z urich

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: