Independence Testing-Based Approach to Causal Discovery under Measurement Error and Linear Non-Gaussian Models

2025-05-06 0 0 1.48MB 35 页 10玖币
侵权投诉
Independence Testing-Based Approach to Causal
Discovery under Measurement Error and
Linear Non-Gaussian Models
Haoyue Dai1,2Peter Spirtes1Kun Zhang1,2
1Department of Philosophy, Carnegie Mellon University
2Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence
hyda@cmu.edu ps7z@andrew.cmu.edu kunz1@cmu.edu
Abstract
Causal discovery aims to recover causal structures generating the observational data.
Despite its success in certain problems, in many real-world scenarios the observed
variables are not the target variables of interest, but the imperfect measures of
the target variables. Causal discovery under measurement error aims to recover
the causal graph among unobserved target variables from observations made with
measurement error. We consider a specific formulation of the problem, where
the unobserved target variables follow a linear non-Gaussian acyclic model, and
the measurement process follows the random measurement error model. Existing
methods on this formulation rely on non-scalable over-complete independent
component analysis (OICA). In this work, we propose the Transformed Independent
Noise (
TIN
) condition, which checks for independence between a specific linear
transformation of some measured variables and certain other measured variables.
By leveraging the non-Gaussianity and higher-order statistics of data,
TIN
is
informative about the graph structure among the unobserved target variables. By
utilizing
TIN
, the ordered group decomposition of the causal model is identifiable.
In other words, we could achieve what once required OICA to achieve by only
conducting independence tests. Experimental results on both synthetic and real-
world data demonstrate the effectiveness and reliability of our method.1
1 Introduction
PM2.5
Figure 1: Example of
measurement error. Gray
nodes are latent underly-
ing variables and white
nodes are observed ones.
Discovery of causal relations is a fundamental goal of science. To identify
causal relations from observational data, known as causal discovery,
has thus drawn much attention in various scientific fields, including
economics, biology, and social science [
45
,
27
,
11
]. Methods for causal
discovery can be roughly categorized into constraint-based ones (e.g.,
PC [
43
]), score-based ones (e.g., Greedy Equivalence Search (GES) [
8
]),
and ones based on structural equation models (SEM) [
38
,
17
,
51
]. Almost
all these methods assume that the recorded values are values of the
variables of interest, which however, is usually not the case in real-world
scenarios. Some variables may be impossible to observe or quantify, so
recorded values are actually a proxy of them (e.g., measure one’s mental
status by survey questionnaire), and some variables, though quantifiable,
may subject to error introduced by instruments (e.g., measure brain signals
using functional magnetic resonance (fMRI)). The difference between
quantities of interest and their measured value is termed as measurement error [12].
1An online demo and codes are available at https://cmu.edu/dietrich/causality/tin.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11021v1 [cs.LG] 20 Oct 2022
Measurement error adversely impairs causal discovery [
28
,
24
,
36
]. The measuring process can be
viewed as directed edges from underlying variables of interest (unobservable) to measured values
(observable), and the d-separation patterns on underlying variables typically do no hold on measured
ones. Consider the causal effects from factory emissions to air quality then to residents’ lung
health , as shown in Figure 1, while we only have corresponding measured quantities: chimney
statistics , PM
2.5
, and hospital reported cases . Though and are independent given ,
and are however dependent given PM
2.5
. If measurement error is severe, and even tend to
be marginally independent [
53
,
36
], which makes PM
2.5
look like a collider (common child). One
might thus incorrectly infer that, lung cancer causes air pollution. In fact, such measurement error is
always a serious threat in environmental epidemiologic studies [30, 10].
Denote by
˜
X={˜
Xi}n
i=1
the latent measurement-error-free variables and
X={Xi}n
i=1
observed
ones. While there are different models for measuring process [
12
,
6
,
49
], in this paper, we consider
the random measurement error model [
36
], where the observed variables are generated from the
latent measurement-error-free variables
˜
Xi
with additive random measurement errors
E={Ei}n
i=1
:
Xi=˜
Xi+Ei.(1)
Measurement errors
E
are assumed to be mutually independent and independent of
˜
X
. We
assume causal sufficiency relative to
˜
X
(i.e., no confounder of
˜
X
who does not have a respective
measurement), and focus on the case where
˜
X
is generated by a linear, non-Gaussian, acyclic
model (LiNGAM [
38
], see §3.1). Note that here w.l.o.g., the linear weights of
{˜
XiXi}n
i=1
are
assumed to be one (since we do not care about scaling). Generally, if observations are measured by
Xi=ci˜
Xi+Eiwith weights {ci}n
i=1 not necessarily being one, all results in this paper still hold.
The objective of causal discovery under measurement error is to recover causal structure among
latent variables
˜
X
, denoted as
˜
G
, a directed acyclic graph (DAG), from contaminated observations
X
. As illustrated by Figure 1, causal discovery methods that utilize (conditional) independence
produce biased estimation (see Proposition 1 for details). SEM-based methods also typically fail to
find correct directions, since the SEM for
˜
X
usually do not hold on
X
. Unobserved
˜
X
are actually
confounders of
X
, and there exists approaches to deal with confounders, such as Fast Causal Inference
(FCI [
44
]). However, they focus on structure among observed variables instead of the unobserved
ones, which is what we aim to recover here. With the interest for the latter, another line of research
called causal discovery with latent variables is developed, which this paper is also categorized
to. However, existing methods [
41
,
42
,
47
,
46
,
1
,
23
,
50
,
35
] cannot be adopted either, since they
typically require at least two measurements (indicators) for each latent variable, while we only have
one for each here (and is thus a more difficult task). Specifically on the measurement error problem,
[
16
] proposes anchored causal inference in the binary setting. In the linear Gaussian setting, [
53
]
presents identifiability conditions by factor analysis. A main difficulty here is the unknown variances
of the measurement errors
E
, otherwise the covariance matrix of
˜
X
can be obtained and readily used.
To this end, [
2
] provides an upper-bound of
E
and [
34
] develops a consistent partial correlations
estimator. In linear non-Gaussian settings (i.e., the setting of this paper), [
54
] shows that the ordered
group decomposition of
˜
G
, which contains major causal information, is identifiable. However, the
corresponding method relies on over-complete independent component analysis (OICA [
19
]), which
is notorious for suffering from local optimal and high computational complexity [
39
,
18
]. Hence, the
identifiability results in [54], despite the theoretical correctness, is far from practical achievability.
The main contributions of this paper are as follows:
1)
We define the
T
ransformed
I
ndependent
N
oise
(
TIN
) condition, which finds and checks for independence between a specific linear transformation
(combination) of some variables and others. The existing Independent Noise (
IN
[
40
]) and General-
ized Independent Noise (
GIN
[
50
]) conditions are special cases of
TIN
.
2)
We provide graphical
criteria of
TIN
, which might further improve identifiability of causal discovery with latent variables.
3)
We exploit
TIN
on a specific task, causal discovery under measurement error and LiNGAM, and
identify the ordered group decomposition. This identifiability result once required computationally
and statistically ineffective OICA to achieve, while we achieve it merely by conducting independence
tests. Evaluation on both synthetic and real-world data demonstrate the effectiveness of our method.
2 Motivation: Independence Condition and Structural Information
The example in Figure 1 illustrates how the (conditional) (in)dependence relations differ between
observed Xand latent ˜
X, and thus lead to biased discovery results. To put it generally, we have,
2
Proposition 1
(rare d-separation)
.
Suppose variables follow random measurement error model
defined in Equation (1). For disjoint sets of observed variables
Z,Y,S
and their respective latent
ones ˜
Z,˜
Y,˜
S, d-separation Z
dY|Sholds, only when marginally ˜
Z
d˜
Y, and ˜
Z
d˜
Y|˜
Shold.
By ‘rare’ we mean that the d-separation patterns among
˜
X
usually do not hold among
X
(except
for rare marginal ones), since the observed variables are not causes of any other (though the latent
variables they intend to measure might be). For example, consider the underlying
˜
G
to be chain
structure (Figure 2a) and fully connected DAG (Figure 2c). There exists no (conditional) independence
on either, and PC algorithm will output just a fully connected skeleton on both cases. Then, without
(conditional) independence (which is non-parametric) to be directly used, can we create independence,
by leveraging the parametric assumption (LiNGAM) and benefit from non-Gaussianity?
Naturally we recall the Independent Noise (IN) condition proposed in Direct-LiNGAM [40]:
Definition 1
(
IN
condition)
.
Let
Yi
be a single variable and
Z
be a set of variables. Suppose variables
follow LiNGAM. We say (
Z, Yi
) satisfies
IN
condition, denoted by
IN(Z, Yi)
, if and only if the
residual of regressing
Yi
on
Z
is statistically independent to
Z
. Mathematically, let
˜ω
be the vector of
regression coefficients, i.e., ˜ω:= cov(Yi,Z) cov(Z,Z)1;IN(Z, Yi)holds iff Yi˜ω|ZZ.
Here “
cov
” denotes the variance-covariance matrix.
IN
identifies exogenous (root) variables, based
on which the causal ordering of variables can be determined (Lemma 1 in [
40
]). However,
IN
cannot be applied to measurement error model. With hidden confounders (
˜
X
) behind observed
X
,
independence between regressor and residual typically does not exist on any regression among
X
. In
fact, Xfollows errors-in-variables models [15, 7], for which the identifiability w.r.t. ˜
Gis not clear.
However, we might still benefit from this idea to leverage non-Gaussianity of exogenous noises.
Consider the Figure 1 example and abstract it to
˜
X1
a˜
X2
b˜
X3
with
{˜
Xi
1Xi}3
i=1
. Although
IN
does not hold on any of
X
, interestingly, there exists a linear transformation of observations
bX2X3
, which contains only
{˜
E3, E2, E3}
(
˜
E1
and
˜
E2
are cancelled out) and shares no common
non-Gaussian noise term with
X1
. Hence, by the
Darmois–Skitovich theorem
[
21
],
bX2X3
is
independent of X1. This finding is echoed in Generalized Independent Noise (GIN [50]) condition:
Definition 2
(
GIN
condition)
.
Let
Z
and
Y
be two sets of random variables that follow LiNGAM.
We say (
Z,Y
) satisfies the
GIN
condition, denoted by
GIN(Z,Y)
, if and only if the following two
conditions are satisfied:
1)
There exists nonzero solution vectors
ωR|Y|
to equation
cov(Z,Y)ω=
0, and 2) Any such solution ωmakes the linear transformation ω|Yindependent of Z.
Here
|Y|
denotes the dimensionality of
Y
. The intuition of
GIN
is that, despite no independent resid-
ual by normal regression, it is possible to realize independent “pseudo-residuals” [
5
] by regressing
with “reference variables”. [
50
] shows
IN
as a special case of
GIN
(Proposition 2), and further gives
graphical criteria (Theorem 2), based on which a recursive learning algorithm is developed to solve the
latent-variable problem. Each latent variable is required to have at least two observations. Interestingly
we find that, in measurement error models, if each latent variable
˜
Xi
has two measurements
Xi1, Xi2
,
then the
GIN
condition can be readily used to fully identify the structure of
˜
G
, which is already a
breakthrough over existing methods [
42
,
47
,
46
,
23
]. See Appendix B for the whole procedure. With
regard to our more challenging task where each
˜
Xi
has only one measurement
Xi
, a natural question
is that, can GIN also help? Given the example in Figure 1 illustrated above, the answer seems to be
affirmative:
GIN({Xi},X\{Xi})
only holds for
i= 1
, so the root can be identified. More generally:
˜
X1
˜
X2
˜
X3
˜
X4
˜
X5
a
b
c
d
(a)
˜
X1
˜
X2
˜
X3
˜
X4
˜
X5
a
c
b
d
e
(b)
˜
X1
˜
X2
˜
X3
˜
X4
a
b
cd
e
f
(c)
˜
X1
˜
X2
˜
X3
˜
X4
˜
X5
(d)
Figure 2: Graph structure
˜
G
exam-
ples. For simplicity, observed
X
,
measurement edges are omitted.
Example 1
(
GIN
on chain structure)
.
Consider cases where the
underlying graph
˜
G
is a chain structure with
n
(
n3
) vertices
and directed edges
{˜
Xi˜
Xi+1}n1
i=1
. Figure 2a is an example
with
n= 5
. We find that
GIN(Z=X1,Y=X2,3,4,5)
holds
(where
X2,3,4,5
denotes
{X2, X3, X4, X5}
; same below), with
solution
ω= [bx bcy bcdz, x, y, z]|
,
x, y, z R
.
ω|Y
cancels noise components in
Y
that are also shared by
Z
, and
thus
ω|YZ
. However,
GIN(Xi,X\{Xi})
is violated for
any other
i6= 1
(see Example 12 for a detailed derivation).
With this asymmetry, the latent root
˜
X1
can be identified. Fur-
thermore, GIN(Xi, Xi+1,···,n)holds for any i= 1,··· , n 2.
Example 1 might give us an intuition that by recursively testing
GIN
(over the newly found subroot and the remaining vari-
3
ables), we could identify the causal ordering of first
n2
variables for any DAG. However, this is
over-optimistic thanks to the sparsity of chain structure. Consider a denser structure:
Example 2
(
GIN
on fully connected DAG)
.
Consider cases where
˜
G
is a fully connected DAG
with
n
(
n3
) vertices and directed edges
{˜
Xi˜
Xj}
for every
i<j
. Figure 2c is an example
with
n= 4
. We first find that
GIN(X1, X2,3,4)
holds. However, in contrast to the chain structure,
GIN(X2, X3,4)
does not hold - there is no way to cancel both
˜
E1,2
from
X3,4
. More generally we
have: GIN(X1, X2,···,n),GIN(X1,2, X3,···,n),···,GIN(X1,···,k, Xk+1,···,n),k=b(n1)/2c.
Since a fully connected DAG is the densest extreme, Example 2 might give us an intuition that
GIN
could identify at least the causal ordering of the first half of variables. Unfortunately, this is still
over-optimistic, since we could not know beforehand the structure type of ˜
G.
Example 3
(
GIN
on chain structure with triangular head)
.
Figure 2b shows a variation of chain
structure, with edges
˜
X1˜
X2,˜
X1˜
X3,˜
X3˜
X2
, and
{˜
Xi˜
Xi+1}n1
i=3
. We name it “chain
structure with triangular head”. Interestingly,
GIN(X2, X3,···,n)
, which is satisfied on Figure 2a,
is also satisfied here. E.g., in Figure 2b (
n= 5
), the solution
ω= [dx dey, x, y]|
,
x, y R
.
ω|Y
cancels
˜
E1,3
from
Y
, and thus
ω|YZ
. Actually, the chain structure with triangular
head
˜
GOC
is unidentifiable with chain structure
˜
GC
w.r.t.
GIN
conditions, i.e., for any two sets of
observed variables
Z,YX
,
GIN(Z,Y)
holds on
˜
GOC
if and only if
GIN(Z,Y)
holds on
˜
GC
.
Consequently, if directly using any recursive algorithm by
GIN
as in Example 1, the output causal
ordering would still be ˜
X1,˜
X2,˜
X3,···, which is incorrect due to ˜
X3˜
X2in the triangular head.
The above examples show that it is not as simple as it seems to use
GIN
for one-measurement model:
only part of the causal ordering can be identified, and worse yet, rather complicated error correction
is needed to deal with possible incorrect orderings. However, after a closer look at the unidentifiable
examples above, we find that actually more information can be uncovered beyond the
GIN
condition:
Example 4
(Asymmetry beyond
GIN
)
. 1)
Consider Example 2 where only the root variable
˜
X1
can
be identified by
GIN
and
˜
X2,3,4
are unidentifiable, i.e., permutation of the labeling of
X2,3,4
will still
preserve the
GIN
condition over any two subsets
Z,Y
. However, there actually exists an asymmetry
between
X2
and
X3,4
: we could construct linear transformation of
X1,3,4
:
cdbe
dX1+df+e
dX3
X4
s.t. it is independent of
X2
, while there exists no such linear transformation of
X1,2,4
to be
independent of
X3
, and also for
X1,2,3
to
X4
.
2)
Consider Examples 1 and 3 where
˜
GC
and
˜
GOC
are unidentifiable w.r.t.
GIN
conditions. Let
Z:=X4
and
Y:=X1,2,3
,
GIN(Z,Y)
is violated on
both graphs. However, an asymmetry actually exists: on
˜
GOC
, we could construct
aX1X2+bX3
(which cancels ˜
E1,3) to be independent to X4, while this is impossible on ˜
GC.
To put simply, the motivation of independent “pseudo-residual” behind
GIN
actually limits the
power of non-Gaussianity, with the coefficients vector
ω
only characterized from variance-covariance
matrix (2nd-order). There are actually two cases for
GIN(Z,Y)
to be violated: 1) though not all
solution
ω
makes
ω|YZ
, there exists non-zero
ω
s.t.
ω|YZ
, and 2) there naturally exists no
non-zero
ω
s.t.
ω|YZ
. The original
GIN
cannot distinguish between these two cases. Hence in
this paper, we first aim to distinguish between the two, generalizing
GIN
condition to
TIN
condition.
3 With Transformed Independent Noise Condition
In the above discussion, one can see that the presence of measurement error affects the conditional
independence relations among the variables and the independent noise condition. However, we
will show in Theorem 3 (§4) that a specific type of independence conditions are shared between
the underlying error-free variables and the measured variables with error. In this section, we will
formulate such independence conditions and investigate their graphical implications for error-free
variables (i.e., variables generated by the LiNGAM without measurement error). In §4, we will then
extend the results to the measured variables. Please note that
in contrast to other sections
, the notation
used in this section, including
X
,
Y
, and
S
, denotes error-free variables generated by the LiNGAM.
3.1 Notations
Let
G
be a directed acyclic graph with the vertex set
V(G)=[n]:={1,2,· · · , n}
and edge set
E(G)
. A directed path
P= (i0, i1,· · · , ik)
in
G
is a sequence of vertices of
G
where there is a
directed edge from
ij
to
ij+1
for any
0jk1
. We use notation
i j
to show that there
4
exists a directed path from vertex
i
to
j
. Note that a single vertex is also a directed path, i.e.,
i i
holds. Let
Z[n]
be a subset of vertices. Define ancestors
Anc(Z):={j|∃iZ, j i}
. Note
that
ZAnc(Z)
. Further let
S
be a subset of vertices. We use notation
i
[S]j
to show that there
exists a directed path from vertex
i
to
j
without passing through
S
, i.e., there exists a directed path
P= (i, m0,· · · , mk, j)
in
G
s.t.
i, j 6∈ S
and
ml6∈ S
for any
0lk
. Define ancestors outside
S
accordingly: for two vertex sets
Y,S
, denote ancestors of
Y
that have directed paths into
Y
without
passing through
S
as
Ancout(S)(Y):={j|∃iY, j
[S]i}
. Note that the graphical definitions here
can also be translated to trek [47] language (see Appendix C for details).
Assume random variables X:={Xi}i[n]are generated by LiNGAM w.r.t. graph G, i.e.,
X=AX +E=BE, with B= (IA)1.(2)
where
E={Ei}i[n]
are corresponding mutually independent exogenous noises.
A
is the adjacency
matrix where entry
Aj,i
is the linear weight of direct causal effect of variable
Xi
on
Xj
.
Aj,i 6= 0
if
and only if there exists edge
ij
.
X
can also be written directly as a mixture of exogenous noises
X=BE
. If the entry of mixing matrix
Bj,i 6= 0
, then
iAnc({j})
. Note that here and in what
follows, we use boldface letters
A,B
to denote matrices, and use boldface letters
S,W,X,Y,Z
with notation abuse: it can denote vertices set, respective random variables set, or random vector.
When we say “two variables sets Z,Y”, if not otherwise specified, Z,Yneed not be disjoint.
3.2 Independent Linear Transformation Subspace and its Characterization
We first give the definition and characterization of the independent linear transformation subspace.
Definition 3
(Independent linear transformation subspace)
.
Let
Z
and
Y
be two subsets of random
variables. Suppose the variables follow the linear non-Gaussian acyclic causal model. Denote:
Z;Y:={ωR|Y||ω|YZ}.(3)
By the property of independence,
Z;Y
is closed under scalar multiplication and addition, and thus is
a subspace in R|Y|. In fact, Z;Ycan be characterized as a nullspace as follows:
Theorem 1 (Characterization of Z;Y).For two variables subsets Zand Y,Z;Ysatisfies:
Z;Y= null(B|
Y,nzcol(BZ,:)).(4)
where
null(·)
denotes nullspace.
BY,nzcol(BZ,:)
denotes the submatrix of mixing matrix
B
, with rows
indexed by
Y
and columns indexed by
nzcol(BZ,:)
.
nzcol(BZ,:)
denotes the column indices where
the submatrix
BZ,:
has non-zero entries.
nzcol(BZ,:)
actually corresponds to the exogenous noises
that constitute Z. Particularly, if assuming “if i jthen Bj,i 6= 0”, then, nzcol(BZ,:) = Anc(Z).
Proof of Theorem 1 is straight-forward by the
Darmois–Skitovich theorem
[
21
]: linear transformation
ω|YZ
if and only if
ω|Y
shares no common non-Gaussian exogenous noise components with
Z
.
Example 5
(Revisiting examples in §2 from
Z;Y
perspective)
.
For illustration, now we revisit the
examples in §2 from the perspective of independent linear transformation subspace.
1 0 0 0 0
a1 0 0 0
ab b 1 0 0
abc bc c 1 0
abcd bcd cd d 1
,
1 0 0 0 0
a+bc 1b0 0
c0 1 0 0
cd 0d1 0
cde 0de e 1
,
1 0 0 0
a1 0 0
ad +b d 1 0
a(df +e) + bf +c df +e f 1
(5)
Equation (5) shows the corresponding mixing matrix
B
for graph
˜
G
in Figures 2a to 2c, re-
spectively. Suppose we have access to underlying variables
˜
Xi
and only focus on
˜
G
. Colored
blocks denote submatrices of
B
.
1)
For the fully connected DAG (Figure 2c, the right matrix),
to identify the root
˜
X1
,
GIN( ˜
X1,˜
X2,3,4)
is satisfied, corresponding to
Z;Y= null( |)
. For
(Z,Y):= ( ˜
X2,˜
X3,4)
or
(˜
X4,˜
X1,2,3)
, there exists no non-zero
ω
s.t.
ω|YZ
, because the lower
part and are full row rank. However, if we set
(Z,Y):= ( ˜
X2,˜
X1,3,4)
, we actually have
˜
X2cdbe
d˜
X1+df+e
d˜
X3˜
X4
, because the stacked two parts of has rank
2<3
- though
GIN
is still violated because
cov(Z,Y)
has rank
1<2
.
2)
For the chain structure
˜
GC
(Figure 2a,
5
摘要:

IndependenceTesting-BasedApproachtoCausalDiscoveryunderMeasurementErrorandLinearNon-GaussianModelsHaoyueDai1;2PeterSpirtes1KunZhang1;21DepartmentofPhilosophy,CarnegieMellonUniversity2MachineLearningDepartment,MohamedbinZayedUniversityofArticialIntelligencehyda@cmu.edups7z@andrew.cmu.edukunz1@cmu.ed...

展开>> 收起<<
Independence Testing-Based Approach to Causal Discovery under Measurement Error and Linear Non-Gaussian Models.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:1.48MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注