RHINO D EEP CAUSAL TEMPORAL RELATIONSHIP LEARNING WITH HISTORY -DEPENDENT NOISE Wenbo Gong Joel Jennings Cheng Zhang Nick Pawlowski

2025-05-03 0 0 814.19KB 28 页 10玖币
侵权投诉
RHINO: DEEP CAUSAL TEMPORAL RELATIONSHIP
LEARNING WITH HISTORY-DEPENDENT NOISE
Wenbo Gong, Joel Jennings, Cheng Zhang & Nick Pawlowski
Microsoft Research
Cambridge, UK
{t-gongwenbo, joeljennings, cheng.zhang, nick.pawlowski}
@microsoft.com
ABSTRACT
Discovering causal relationships between different variables from time series data
has been a long-standing challenge for many domains such as climate science,
finance and healthcare. Given the the complexity of real-world relationships and
the nature of observations in discrete time, causal discovery methods need to con-
sider non-linear relations between variables, instantaneous effects and history de-
pendent noise (the change of noise distribution due to past actions). However,
previous works do not offer a solution addressing all these problems together. In
this paper, we propose a novel causal relationship learning framework for time-
series data, called Rhino, which combines vector auto-regression, deep learning
and variational inference to model non-linear relationships with instantaneous ef-
fects while allowing the noise distribution to be modulated by historical observa-
tions. Theoretically, we prove the structural identifiability of Rhino. Our empir-
ical results from extensive synthetic experiments and two real-world benchmarks
demonstrate better discovery performance compared to relevant baselines, with
ablation studies revealing its robustness under model misspecification.
1 INTRODUCTION
Time series data is a collection of data points recorded at different timestamps describing a pattern
of chronological change. Identifying the causal relations between different variables and their in-
teractions through time (Spirtes et al., 2000; Berzuini et al., 2012; Guo et al., 2020; Peters et al.,
2017) is essential for many applications e.g. climate science, health care, etc. Randomized control
trials are the gold standard for discovering such relationships, but may be unavailable due to cost
and ethical constraints. Therefore, causal discovery with just observational data is important and
fundamental to many real-world applications (L¨
owe et al., 2022; Bussmann et al., 2021; Moraffah
et al., 2021; Wu et al., 2020; Runge, 2018; Tank et al., 2018; Hyv¨
arinen et al., 2010; Pamfil et al.,
2020).
The task of temporal causal discovery can be challenging for several reasons: (1) relations between
variables can be non-linear in the real world; (2) with a slow sampling interval, everything happens
in between will be aggregated into the same timestamp, i.e. instantaneous effect; (3) the noise may
be non-stationary (its distribution depends on the past observations), i.e. history-dependent noise.
For example, in stock markets, the announcements of some decisions from a leading company after
the market closes may have complex effects (i.e. non-linearity) on its stock price immediately after
the market opening (i.e. slow sampling interval and instantaneous effect) and its price volatility may
also be changed (i.e. history-dependent noise). Similarly, in education, students that recently earned
good marks on algebra tests should also score well on an upcoming algebra exam with little variation
(i.e. history-dependent noise).
To the best of our knowledge, existing frameworks’ performances suffer in many real-world sce-
narios as they cannot address these aspects in a satisfactory way. Especially, history-dependent
noise has been rarely considered in past. A large category of the preceding works, called Granger
causality (Granger, 1969), is based on the fact that cause-effect relationships can never go against
time. Despite many recent advances (Wu et al., 2020; Shojaie & Michailidis, 2010; Siggiridou &
1
arXiv:2210.14706v1 [cs.LG] 26 Oct 2022
Kugiumtzis, 2015; Amornbunchornvej et al., 2019; L¨
owe et al., 2022; Tank et al., 2018; Bussmann
et al., 2021; Dang et al., 2018; Xu et al., 2019), they all rely on the absence of instantaneous effects
with a fixed noise distribution. Constraint-based methods have also been extended for time series
causal discovery (Runge, 2018; 2020), which is commonly applied by folding the time-series. This
introduced new assumptions and translated the aforementioned requirements to challenges in condi-
tional independence testing (Shah & Peters, 2020).Additionally, they require a stronger faithfulness
assumption and can only identify the causal graph up to a Markov equivalence class without detailed
functional relationships.
An alternative line of research leverages the development of causal discovery with functional causal
models (Hyv¨
arinen et al., 2010; Pamfil et al., 2020; Peters et al., 2013). They can model both instan-
taneous and lagged effects as long as they have theoretically guaranteed structural identifiability.
Unfortunately, they do not consider history-dependent noise. One central challenge of modelling
this dependency is that noise depending on the lagged parents may break the model structural iden-
tifiability. For static data, Khemakhem et al. (2021) proves the structural identifiability only when
this dependency is restricted to a simple functional form. Thus, the key research question is whether
the identifiability can be preserved with complex historical dependencies in the temporal setting.
Motivated by these requirements, we propose a novel temporal discovery called Rhino (deep causal
temporal relationship learning with history dependent noise), which can model non-linear lagged
and instantaneous effects with flexible history-dependent noise. Our contributions are:
A novel causal discovery framework called Rhino, which combines vector auto-regression
and deep learning to model non-linear lagged and instantaneous effects with history-
dependent noise. We also propose a principled training framework using variational in-
ference.
We prove that Rhino is structurally identifiable. To achieve this, we provide general condi-
tions for structural identifiability with history-dependent noise, of which Rhino is a special
case. Furthermore, we clarify relations to several previous works.
We conduct extensive synthetic experiments with ablation studies to demonstrate the advan-
tages of Rhino and its robustness under model misspecification. Additionally, we compare
its performance to a wide range of baselines in two real-world discovery benchmarks.
2 BACKGROUND
In this section, we briefly introduce necessary preliminaries for Rhino. In particular, we focus on
structural equation models, Granger causality (Granger, 1969) and vector auto-regression.
Structural Equation Models (SEMs) Consider XRDwith Dvariables, SEM describes the
causal relationships between them given a causal graph G:
Xi=fi(Pai
G, i)(1)
where Pai
Gare the parents of node iand iare mutually independent noise variables. Under the
context of multivariate time series, Xt=Xi
tiVwhere Vis a set of nodes with size D, the
corresponding SEM given a temporal causal graph Gis
Xi
t=fi,t(Pai
G(< t),Pai
G(t), i
t),(2)
where Pai
G(< t)contains the parent values specified by Gin previous time (lagged parents); Pai
G(t)
are the parents at the current time t(instantaneous parents). The above SEM induces a joint distri-
bution over the stationary time series {Xt}T
t=0 (see Assumption 1 in Appendix B for the definition).
However, functional causal models with the above general form cannot be directly used for causal
discovery due to the structural unidentifiability (Lemma 1, Zhang et al. (2015) One way to solve this
is sacrificing the flexibility by restricting the functional class. For example, additive noise models
(ANM), (Hoyer et al., 2008)
Xi=fi(PaG(Xi)) + i,(3)
which have recently been used for causal reasoning with non-temporal data (Geffner et al., 2022).
2
Granger Causality Granger causality (Granger, 1969) has been extensively used for temporal
causal discovery. It is based on the idea that the series Xjdoes not Granger cause Xiif the history,
Xj
<t, does not help the prediction of Xi
tfor some tgiven the past of all other time series Xkfor
k6=j, i.
Definition 2.1 (Granger Causality (Tank et al., 2018; L¨
owe et al., 2022)).Given a multivariate
stationary time series {Xt}T
t=0 and a SEM fi,t defined as
Xi
t=fi,t(Pai
G(< t)) + i
t,(4)
XjGranger causes Xiif l[1, t]such that Xj
tlPai
G(< t)and fi,t depends on Xj
tl.
Granger causality is equivalent to causal relations for directed acyclic graph (DAG) if there are no
latent confounders and instantaneous effects (Peters et al., 2013; 2017). Apart from the lack of
instantaneous effects, it also ignore the history-dependent noise with independent i
t.
Vector Auto-regressive Model Another line of research focuses on directly fitting the identifiable
SEM to the observational data with instantaneous effects. One commonly-used approach is called
vector auto-regression (Hyv¨
arinen et al., 2010; Pamfil et al., 2020):
Xi
t=βi+
K
X
τ=0
D
X
j=1
Bτ,jiXj
tτ+i
t(5)
where βiis the offset, τis the model lag, BτRD×Dis the weighted adjacency matrix specifying
the connections at time tτ(i.e. if Bτ,ji = 0 means no connection from Xj
tτto Xi
t) and i
tis
the independent noise. Under these assumptions, the above linear SEM is structurally identifiable,
which is a necessary condition for recovering the ground truth graph (Hyv¨
arinen et al., 2010; Peters
et al., 2013; Pamfil et al., 2020). However, the above linear SEM with independent noise variables
is too restrictive to fulfil the requirements described in Section 1. Therefore, the research question
is how to design a structurally identifiable non-linear SEM with flexible history-dependent noise.
3 RHINO: RELATIONSHIP LEARNING WITH HISTORY DEPENDENT NOISE
This section introduces the Rhino model: Section 3.1 describes specific choices in the form of
Rhino’s SEM, allowing for history-dependent noise. Section 3.2 details how vartiaional inference
can be leveraged to perform causal discovery with the proposed functional form of the SEM.
3.1 MODEL FORMULATION
For a multivariate stationary time series {Xt}T
t=0, we assume that their causal relations follow a
temporal adjacency matrix G0:Kwith maximum lag Kwhere Gτ[1,K]specifies the lagged effects
between Xtτand Xt,G0specifies the instantaneous parents. We define Gτ,ij = 1 if Xi
tτ
Xj
tand 0otherwise. 1We propose a novel functional causal model that incorporates non-linear
relations, instantaneous effects, and flexible history-dependent noise, called Rhino:
Xi
t=fi(Pai
G(< t),Pai
G(t)) + gi(Pai
G(< t), i
t)(6)
where fiis a general differentiable non-linear function, and giis a differentiable transform s.t. the
transformed noise has a proper density. Despite that Rhino has an additive structure, our formu-
lation offers much more flexibility in both functional relations and noise distributions compared
to previous works (Pamfil et al., 2020; Peters et al., 2013). By placing few restrictions on fi, gi,
Rhino can capture functional non-linearity through fiand transform i
tthrough a flexible function
gi, depending on Pai
G(< t), to capture the history dependency of the additive noise.
Next, we propose flexible functional designs for fi, gi, which must respect the relations encapsulated
in G. Namely, if Xj
tτ/Pai
G(< t)Pai
G(t), then fi/∂Xj
tτ= 0 and similarly for gi. We design
fi(Pai
G(< t),Pai
G(t)) = ζi
K
X
τ=0
D
X
j=1
Gτ,ji`τ j Xj
tτ
(7)
1In the following, we interchange the usage of the notation Gand G0:Kfor brevity.
3
where ζiand `τ i (i[1, D]and τ[0, K]) are neural networks. For efficient computation, we use
weight sharing across nodes and lags: ζi(·) = ζ(·,u0,i)and `τ j (·) = `(·,uτ,j ), where uτ,i is the
trainable embedding for node iat time tτ.
The design of gineeds to properly balance the flexibility and tractability of the transformed noise
density for the sake of training. We thus choose a conditional normalizing flow, called conditional
spline flow (Trippe & Turner, 2018; Durkan et al., 2019; Pawlowski et al., 2020), with a fixed
Gaussian noise i
tfor all tand i. The spline bin parameters are predicted using a hyper-network with
a similar form to Eq. (7) to incorporate history dependency. The only difference is now τis summed
over [1, K]to remove the instantaneous parents. Due to the invertibility of gi, the noise likelihood
conditioned on lagged parents is
pgi(gi(i
t)|Pai
G(< t)) = p(i
t)
g1
i
i
t.(8)
3.2 VARIATIONAL INFERENCE FOR RHINO
Rhino adopts a Bayesian view of causal discovery (Heckerman et al., 2006), which aims to learn a
graph posterior distribution instead of inferring a single graph. For Nobserved multivariate time
series X(1)
0:T,...,X(N)
0:T, the joint likelihood of Rhino is
p(X(1)
0:T,...,X(N)
0:T,G) = p(G)
N
Y
n=1
pθ(X(n)
0:T|G)(9)
where θare the model parameters. Once fitted, the posterior p(G|X(1)
0:T,...,X(N)
0:T)incorporates
the belief of the underlying causal relationships.
Graph Prior When designing the graph prior, we combine three components: (1) DAG con-
straint; (2) graph sparseness prior; (3) domain-specific prior knowledge (optional). Inspired by the
NOTEARS (Zheng et al., 2018; Geffner et al., 2022; Morales-Alvarez et al., 2021), we propose the
following unnormalised prior
p(G)exp λskG0:Kk2
Fρh2(G0)αh(G0)λpkG0:KGp
0:Kk2
F(10)
where h(G) = tr(eGG)Dis the DAG penalty proposed in (Zheng et al., 2018) and is 0 if and
only if Gis a DAG; is the Hadamard product; Gpis an optional domain-specific prior graph,
which can be used when partial domain knowledge is available; λs,λpspecify the strength of the
graph sparseness and domain-specific prior terms respectively; α,ρcharacterize the strength of the
DAG penalty. Since the lagged connections specified in G1:Kcan only follow the direction of time,
only the instantaneous part, G0, can contain cycles. Thus, the DAG penalty is only applied to G0.
Variational Objective Unfortunately, the exact graph posterior p(G|X(1)
0:T,...,X(N)
0:T)is in-
tractable due to the large combinatorial space of DAGs. To overcome this challenge, we adopt
variational inference (Blei et al., 2017; Zhang et al., 2018), which uses a variational distribution
qφ(G)to approximate the true posterior. We choose qφ(G)to be a product of independent Bernoulli
distributions (refer to Appendix E for details). The corresponding evidence lower bound (ELBO) is
log pθX(1)
0:T,...,X(N)
0:TEqφ(G)"N
X
n=1
log pθ(X(n)
0:T|G) + log p(G)#+H(qφ(G))
| {z }
ELBO(θ,φ)
(11)
where H(qφ(G)) is the entropy of qφ(G). From the causal Markov assumption and auto-regressive
nature, we can further simplify
log pθ(X(n)
0:T|G) =
T
X
t=0
D
X
i=1
log pθ(Xi,(n)
t|Pai
G(< t),Pai
G(t)) (12)
and from Rhino’s functional form (Eq. (6)) proposed in Section 3.1
log pθ(Xi,(n)
t|Pai
G(< t),Pai
G(t)) = log pgizi,(n)
t|Pai
G(< t)(13)
4
where zi,(n)
t=Xi,(n)
tfi(Pai
G(< t),Pai
G(t)) and pgiis defined in Eq. (8) (Appendix A for details).
The parameters θ,φare learned by maximizing the ELBO, where the Gumbel-softmax gradient
estimator is used for φ(Jang et al., 2016; Maddison et al., 2016). We also leverage augmented
Lagrangian training (Hestenes, 1969; Andreani et al., 2008), similar as Geffner et al. (2022), to
anneal α, ρ in the prior to make sure Rhino only produces DAGs (refer to Appendix B.1 in Geffner
et al. (2022)). Once Rhino is fitted, the temporal causal graph can be inferred by Gqφ(G).
Treatment effect estimation As Rhino learns the causal graph and the functional relationship
simultaneously, our model can be extended for causal inference tasks such as treatment effect esti-
mation (Geffner et al., 2022). See Appendix D for details.
4 THEORETICAL CONSIDERATIONS
In this section, we focus on the theoretical guarantees of Rhino including (1) structural identifiability
and (2) the soundness of the proposed variational objective. Together, they guarantee the validity of
Rhino as a causal discovery method. In the end, we clarify its relations to existing works.
4.1 STRUCTURAL IDENTIFIABILITY
One of the key challenges for causal discovery with a flexible functional relationship is to show the
structural identifiability. Namely, we cannot find two different graphs that induce the same joint
likelihood from the proposed functional causal model. In the following, we present a theorem for
Rhino that summarizes our main theoretical contribution.
Theorem 1 (Identifiability of Rhino).Assuming Rhino satisfies the causal Markov property, causal
minimality, causal sufficiency and the induced joint likelihood has a proper density (see Appendix B
for details), and we further assume (1) all functions and induced distributions of Rhino are third-
order differentiable; (2) function fiis non-linear and not invertible w.r.t. any nodes in Pai
G(t); (3)
the double derivative (log pgi(gi(i
t)|Pai
G(< t)))00 w.r.t i
tis zero at most at some discrete points,
then Rhino defined in Eq. (6) is structural identifiable for both bivariate and multivariate time series.
Sketch of proof. This theorem is a summary of a collection of theorems proved in Appendix B.
The strategy is instead of directly proving the identifiability of Rhino, we provide identifiability
conditions for a general temporal SEMs, followed by showing a generalization of Rhino satisfies
these conditions. The identifiability of Rhino directly follows from it.
Prove bivariate identifiability conditions for general temporal SEMs The first step is to prove
the bivariate identifiability conditions that a general temporal SEM (Eq. (2)) should satisfy (refer
to Theorem 3 in Appendix B.1). In a nutshell, we proved the functional causal model is bivariate
identifiable if (1) the model for initial conditions is identifiable; (2) the model is identifiable w.r.t.
instantaneous parents. Remarkably, (2) implies we only need to pay attention to instantaneous
parents for identifiability, and opens the door for flexible lagged parent dependency. This theorem
assumes causal Markov, minimality, sufficiency and proper density assumptions.
Identifiability of history-dependent post non-linear model Next, we propose a novel gener-
alization of Rhino, called history-dependent PNL. Theorem 4 and Corollary 4.1 in Appendix B.2
prove it is bivariate identifiable w.r.t. instantaneous parents (i.e. satisfy the conditions in Theorem 3)
with additional assumptions (1), (2) and (3) in Theorem 1. The history-dependent PNL is defined as
Xi
t=νit fit Pai
G(< t),Pai
G(t)+git Pai
G(< t), it,Pai
G(< t),
where νis invertible w.r.t. the first argument. The bivariate identifiability of Rhino directly follows
from this, since Rhino is a special case with νbeing the identity mapping.
Generalization to multivariate case In the end, inspired by Peters et al. (2012), we prove the
above bivariate identifiability can be generalized to the multivariate case. Refer to Theorem 5 in
Appendix B.3 for details.
5
摘要:

RHINO:DEEPCAUSALTEMPORALRELATIONSHIPLEARNINGWITHHISTORY-DEPENDENTNOISEWenboGong,JoelJennings,ChengZhang&NickPawlowskiMicrosoftResearchCambridge,UKft-gongwenbo,joeljennings,cheng.zhang,nick.pawlowskig@microsoft.comABSTRACTDiscoveringcausalrelationshipsbetweendifferentvariablesfromtimeseriesdatahasbee...

展开>> 收起<<
RHINO D EEP CAUSAL TEMPORAL RELATIONSHIP LEARNING WITH HISTORY -DEPENDENT NOISE Wenbo Gong Joel Jennings Cheng Zhang Nick Pawlowski.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:814.19KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注