RHINO D EEP CAUSAL TEMPORAL RELATIONSHIP LEARNING WITH HISTORY -DEPENDENT NOISE Wenbo Gong Joel Jennings Cheng Zhang Nick Pawlowski

2025-05-03 0 0 814.19KB 28 页 10玖币

侵权投诉

RHINO: DEEP CAUSAL TEMPORAL RELATIONSHIP

LEARNING WITH HISTORY-DEPENDENT NOISE

Wenbo Gong, Joel Jennings, Cheng Zhang & Nick Pawlowski

Microsoft Research

Cambridge, UK

{t-gongwenbo, joeljennings, cheng.zhang, nick.pawlowski}

@microsoft.com

ABSTRACT

Discovering causal relationships between different variables from time series data

has been a long-standing challenge for many domains such as climate science,

ﬁnance and healthcare. Given the the complexity of real-world relationships and

the nature of observations in discrete time, causal discovery methods need to con-

sider non-linear relations between variables, instantaneous effects and history de-

pendent noise (the change of noise distribution due to past actions). However,

previous works do not offer a solution addressing all these problems together. In

this paper, we propose a novel causal relationship learning framework for time-

series data, called Rhino, which combines vector auto-regression, deep learning

and variational inference to model non-linear relationships with instantaneous ef-

fects while allowing the noise distribution to be modulated by historical observa-

tions. Theoretically, we prove the structural identiﬁability of Rhino. Our empir-

ical results from extensive synthetic experiments and two real-world benchmarks

demonstrate better discovery performance compared to relevant baselines, with

ablation studies revealing its robustness under model misspeciﬁcation.

1 INTRODUCTION

Time series data is a collection of data points recorded at different timestamps describing a pattern

of chronological change. Identifying the causal relations between different variables and their in-

teractions through time (Spirtes et al., 2000; Berzuini et al., 2012; Guo et al., 2020; Peters et al.,

2017) is essential for many applications e.g. climate science, health care, etc. Randomized control

trials are the gold standard for discovering such relationships, but may be unavailable due to cost

and ethical constraints. Therefore, causal discovery with just observational data is important and

fundamental to many real-world applications (L¨

owe et al., 2022; Bussmann et al., 2021; Moraffah

et al., 2021; Wu et al., 2020; Runge, 2018; Tank et al., 2018; Hyv¨

arinen et al., 2010; Pamﬁl et al.,

2020).

The task of temporal causal discovery can be challenging for several reasons: (1) relations between

variables can be non-linear in the real world; (2) with a slow sampling interval, everything happens

in between will be aggregated into the same timestamp, i.e. instantaneous effect; (3) the noise may

be non-stationary (its distribution depends on the past observations), i.e. history-dependent noise.

For example, in stock markets, the announcements of some decisions from a leading company after

the market closes may have complex effects (i.e. non-linearity) on its stock price immediately after

the market opening (i.e. slow sampling interval and instantaneous effect) and its price volatility may

also be changed (i.e. history-dependent noise). Similarly, in education, students that recently earned

good marks on algebra tests should also score well on an upcoming algebra exam with little variation

(i.e. history-dependent noise).

To the best of our knowledge, existing frameworks’ performances suffer in many real-world sce-

narios as they cannot address these aspects in a satisfactory way. Especially, history-dependent

noise has been rarely considered in past. A large category of the preceding works, called Granger

causality (Granger, 1969), is based on the fact that cause-effect relationships can never go against

time. Despite many recent advances (Wu et al., 2020; Shojaie & Michailidis, 2010; Siggiridou &

arXiv:2210.14706v1 [cs.LG] 26 Oct 2022

Kugiumtzis, 2015; Amornbunchornvej et al., 2019; L¨

owe et al., 2022; Tank et al., 2018; Bussmann

et al., 2021; Dang et al., 2018; Xu et al., 2019), they all rely on the absence of instantaneous effects

with a ﬁxed noise distribution. Constraint-based methods have also been extended for time series

causal discovery (Runge, 2018; 2020), which is commonly applied by folding the time-series. This

introduced new assumptions and translated the aforementioned requirements to challenges in condi-

tional independence testing (Shah & Peters, 2020).Additionally, they require a stronger faithfulness

assumption and can only identify the causal graph up to a Markov equivalence class without detailed

functional relationships.

An alternative line of research leverages the development of causal discovery with functional causal

models (Hyv¨

arinen et al., 2010; Pamﬁl et al., 2020; Peters et al., 2013). They can model both instan-

taneous and lagged effects as long as they have theoretically guaranteed structural identiﬁability.

Unfortunately, they do not consider history-dependent noise. One central challenge of modelling

this dependency is that noise depending on the lagged parents may break the model structural iden-

tiﬁability. For static data, Khemakhem et al. (2021) proves the structural identiﬁability only when

this dependency is restricted to a simple functional form. Thus, the key research question is whether

the identiﬁability can be preserved with complex historical dependencies in the temporal setting.

Motivated by these requirements, we propose a novel temporal discovery called Rhino (deep causal

temporal relationship learning with history dependent noise), which can model non-linear lagged

and instantaneous effects with ﬂexible history-dependent noise. Our contributions are:

• A novel causal discovery framework called Rhino, which combines vector auto-regression

and deep learning to model non-linear lagged and instantaneous effects with history-

dependent noise. We also propose a principled training framework using variational in-

ference.

• We prove that Rhino is structurally identiﬁable. To achieve this, we provide general condi-

tions for structural identiﬁability with history-dependent noise, of which Rhino is a special

case. Furthermore, we clarify relations to several previous works.

• We conduct extensive synthetic experiments with ablation studies to demonstrate the advan-

tages of Rhino and its robustness under model misspeciﬁcation. Additionally, we compare

its performance to a wide range of baselines in two real-world discovery benchmarks.

2 BACKGROUND

In this section, we brieﬂy introduce necessary preliminaries for Rhino. In particular, we focus on

structural equation models, Granger causality (Granger, 1969) and vector auto-regression.

Structural Equation Models (SEMs) Consider X∈RDwith Dvariables, SEM describes the

causal relationships between them given a causal graph G:

Xi=fi(Pai

G, i)(1)

where Pai

Gare the parents of node iand iare mutually independent noise variables. Under the

context of multivariate time series, Xt=Xi

ti∈Vwhere Vis a set of nodes with size D, the

corresponding SEM given a temporal causal graph Gis

t=fi,t(Pai

G(< t),Pai

G(t), i

t),(2)

where Pai

G(< t)contains the parent values speciﬁed by Gin previous time (lagged parents); Pai

G(t)

are the parents at the current time t(instantaneous parents). The above SEM induces a joint distri-

bution over the stationary time series {Xt}T

t=0 (see Assumption 1 in Appendix B for the deﬁnition).

However, functional causal models with the above general form cannot be directly used for causal

discovery due to the structural unidentiﬁability (Lemma 1, Zhang et al. (2015) One way to solve this

is sacriﬁcing the ﬂexibility by restricting the functional class. For example, additive noise models

(ANM), (Hoyer et al., 2008)

Xi=fi(PaG(Xi)) + i,(3)

which have recently been used for causal reasoning with non-temporal data (Geffner et al., 2022).

Granger Causality Granger causality (Granger, 1969) has been extensively used for temporal

causal discovery. It is based on the idea that the series Xjdoes not Granger cause Xiif the history,

<t, does not help the prediction of Xi

tfor some tgiven the past of all other time series Xkfor

k6=j, i.

Deﬁnition 2.1 (Granger Causality (Tank et al., 2018; L¨

owe et al., 2022)).Given a multivariate

stationary time series {Xt}T

t=0 and a SEM fi,t deﬁned as

t=fi,t(Pai

G(< t)) + i

t,(4)

XjGranger causes Xiif ∃l∈[1, t]such that Xj

t−l∈Pai

G(< t)and fi,t depends on Xj

t−l.

Granger causality is equivalent to causal relations for directed acyclic graph (DAG) if there are no

latent confounders and instantaneous effects (Peters et al., 2013; 2017). Apart from the lack of

instantaneous effects, it also ignore the history-dependent noise with independent i

Vector Auto-regressive Model Another line of research focuses on directly ﬁtting the identiﬁable

SEM to the observational data with instantaneous effects. One commonly-used approach is called

vector auto-regression (Hyv¨

arinen et al., 2010; Pamﬁl et al., 2020):

t=βi+

τ=0

j=1

Bτ,jiXj

t−τ+i

t(5)

where βiis the offset, τis the model lag, Bτ∈RD×Dis the weighted adjacency matrix specifying

the connections at time t−τ(i.e. if Bτ,ji = 0 means no connection from Xj

t−τto Xi

t) and i

tis

the independent noise. Under these assumptions, the above linear SEM is structurally identiﬁable,

which is a necessary condition for recovering the ground truth graph (Hyv¨

arinen et al., 2010; Peters

et al., 2013; Pamﬁl et al., 2020). However, the above linear SEM with independent noise variables

is too restrictive to fulﬁl the requirements described in Section 1. Therefore, the research question

is how to design a structurally identiﬁable non-linear SEM with ﬂexible history-dependent noise.

3 RHINO: RELATIONSHIP LEARNING WITH HISTORY DEPENDENT NOISE

This section introduces the Rhino model: Section 3.1 describes speciﬁc choices in the form of

Rhino’s SEM, allowing for history-dependent noise. Section 3.2 details how vartiaional inference

can be leveraged to perform causal discovery with the proposed functional form of the SEM.

3.1 MODEL FORMULATION

For a multivariate stationary time series {Xt}T

t=0, we assume that their causal relations follow a

temporal adjacency matrix G0:Kwith maximum lag Kwhere Gτ∈[1,K]speciﬁes the lagged effects

between Xt−τand Xt,G0speciﬁes the instantaneous parents. We deﬁne Gτ,ij = 1 if Xi

t−τ→

tand 0otherwise. 1We propose a novel functional causal model that incorporates non-linear

relations, instantaneous effects, and ﬂexible history-dependent noise, called Rhino:

t=fi(Pai

G(< t),Pai

G(t)) + gi(Pai

G(< t), i

t)(6)

where fiis a general differentiable non-linear function, and giis a differentiable transform s.t. the

transformed noise has a proper density. Despite that Rhino has an additive structure, our formu-

lation offers much more ﬂexibility in both functional relations and noise distributions compared

to previous works (Pamﬁl et al., 2020; Peters et al., 2013). By placing few restrictions on fi, gi,

Rhino can capture functional non-linearity through fiand transform i

tthrough a ﬂexible function

gi, depending on Pai

G(< t), to capture the history dependency of the additive noise.

Next, we propose ﬂexible functional designs for fi, gi, which must respect the relations encapsulated

in G. Namely, if Xj

t−τ/∈Pai

G(< t)∪Pai

G(t), then ∂fi/∂Xj

t−τ= 0 and similarly for gi. We design

fi(Pai

G(< t),Pai

G(t)) = ζi



τ=0

j=1

Gτ,ji`τ j Xj

t−τ

(7)

1In the following, we interchange the usage of the notation Gand G0:Kfor brevity.

where ζiand `τ i (i∈[1, D]and τ∈[0, K]) are neural networks. For efﬁcient computation, we use

weight sharing across nodes and lags: ζi(·) = ζ(·,u0,i)and `τ j (·) = `(·,uτ,j ), where uτ,i is the

trainable embedding for node iat time t−τ.

The design of gineeds to properly balance the ﬂexibility and tractability of the transformed noise

density for the sake of training. We thus choose a conditional normalizing ﬂow, called conditional

spline ﬂow (Trippe & Turner, 2018; Durkan et al., 2019; Pawlowski et al., 2020), with a ﬁxed

Gaussian noise i

tfor all tand i. The spline bin parameters are predicted using a hyper-network with

a similar form to Eq. (7) to incorporate history dependency. The only difference is now τis summed

over [1, K]to remove the instantaneous parents. Due to the invertibility of gi, the noise likelihood

conditioned on lagged parents is

pgi(gi(i

t)|Pai

G(< t)) = p(i

t)

∂g−1

∂i

t.(8)

3.2 VARIATIONAL INFERENCE FOR RHINO

Rhino adopts a Bayesian view of causal discovery (Heckerman et al., 2006), which aims to learn a

graph posterior distribution instead of inferring a single graph. For Nobserved multivariate time

series X(1)

0:T,...,X(N)

0:T, the joint likelihood of Rhino is

p(X(1)

0:T,...,X(N)

0:T,G) = p(G)

n=1

pθ(X(n)

0:T|G)(9)

where θare the model parameters. Once ﬁtted, the posterior p(G|X(1)

0:T,...,X(N)

0:T)incorporates

the belief of the underlying causal relationships.

Graph Prior When designing the graph prior, we combine three components: (1) DAG con-

straint; (2) graph sparseness prior; (3) domain-speciﬁc prior knowledge (optional). Inspired by the

NOTEARS (Zheng et al., 2018; Geffner et al., 2022; Morales-Alvarez et al., 2021), we propose the

following unnormalised prior

p(G)∝exp −λskG0:Kk2

F−ρh2(G0)−αh(G0)−λpkG0:K−Gp

0:Kk2

F(10)

where h(G) = tr(eGG)−Dis the DAG penalty proposed in (Zheng et al., 2018) and is 0 if and

only if Gis a DAG; is the Hadamard product; Gpis an optional domain-speciﬁc prior graph,

which can be used when partial domain knowledge is available; λs,λpspecify the strength of the

graph sparseness and domain-speciﬁc prior terms respectively; α,ρcharacterize the strength of the

DAG penalty. Since the lagged connections speciﬁed in G1:Kcan only follow the direction of time,

only the instantaneous part, G0, can contain cycles. Thus, the DAG penalty is only applied to G0.

Variational Objective Unfortunately, the exact graph posterior p(G|X(1)

0:T,...,X(N)

0:T)is in-

tractable due to the large combinatorial space of DAGs. To overcome this challenge, we adopt

variational inference (Blei et al., 2017; Zhang et al., 2018), which uses a variational distribution

qφ(G)to approximate the true posterior. We choose qφ(G)to be a product of independent Bernoulli

distributions (refer to Appendix E for details). The corresponding evidence lower bound (ELBO) is

log pθX(1)

0:T,...,X(N)

0:T≥Eqφ(G)"N

n=1

log pθ(X(n)

0:T|G) + log p(G)#+H(qφ(G))

| {z }

ELBO(θ,φ)

(11)

where H(qφ(G)) is the entropy of qφ(G). From the causal Markov assumption and auto-regressive

nature, we can further simplify

log pθ(X(n)

0:T|G) =

t=0

i=1

log pθ(Xi,(n)

t|Pai

G(< t),Pai

G(t)) (12)

and from Rhino’s functional form (Eq. (6)) proposed in Section 3.1

log pθ(Xi,(n)

t|Pai

G(< t),Pai

G(t)) = log pgizi,(n)

t|Pai

G(< t)(13)

where zi,(n)

t=Xi,(n)

t−fi(Pai

G(< t),Pai

G(t)) and pgiis deﬁned in Eq. (8) (Appendix A for details).

The parameters θ,φare learned by maximizing the ELBO, where the Gumbel-softmax gradient

estimator is used for φ(Jang et al., 2016; Maddison et al., 2016). We also leverage augmented

Lagrangian training (Hestenes, 1969; Andreani et al., 2008), similar as Geffner et al. (2022), to

anneal α, ρ in the prior to make sure Rhino only produces DAGs (refer to Appendix B.1 in Geffner

et al. (2022)). Once Rhino is ﬁtted, the temporal causal graph can be inferred by G∼qφ(G).

Treatment effect estimation As Rhino learns the causal graph and the functional relationship

simultaneously, our model can be extended for causal inference tasks such as treatment effect esti-

mation (Geffner et al., 2022). See Appendix D for details.

4 THEORETICAL CONSIDERATIONS

In this section, we focus on the theoretical guarantees of Rhino including (1) structural identiﬁability

and (2) the soundness of the proposed variational objective. Together, they guarantee the validity of

Rhino as a causal discovery method. In the end, we clarify its relations to existing works.

4.1 STRUCTURAL IDENTIFIABILITY

One of the key challenges for causal discovery with a ﬂexible functional relationship is to show the

structural identiﬁability. Namely, we cannot ﬁnd two different graphs that induce the same joint

likelihood from the proposed functional causal model. In the following, we present a theorem for

Rhino that summarizes our main theoretical contribution.

Theorem 1 (Identiﬁability of Rhino).Assuming Rhino satisﬁes the causal Markov property, causal

minimality, causal sufﬁciency and the induced joint likelihood has a proper density (see Appendix B

for details), and we further assume (1) all functions and induced distributions of Rhino are third-

order differentiable; (2) function fiis non-linear and not invertible w.r.t. any nodes in Pai

G(t); (3)

the double derivative (log pgi(gi(i

t)|Pai

G(< t)))00 w.r.t i

tis zero at most at some discrete points,

then Rhino deﬁned in Eq. (6) is structural identiﬁable for both bivariate and multivariate time series.

Sketch of proof. This theorem is a summary of a collection of theorems proved in Appendix B.

The strategy is instead of directly proving the identiﬁability of Rhino, we provide identiﬁability

conditions for a general temporal SEMs, followed by showing a generalization of Rhino satisﬁes

these conditions. The identiﬁability of Rhino directly follows from it.

Prove bivariate identiﬁability conditions for general temporal SEMs The ﬁrst step is to prove

the bivariate identiﬁability conditions that a general temporal SEM (Eq. (2)) should satisfy (refer

to Theorem 3 in Appendix B.1). In a nutshell, we proved the functional causal model is bivariate

identiﬁable if (1) the model for initial conditions is identiﬁable; (2) the model is identiﬁable w.r.t.

instantaneous parents. Remarkably, (2) implies we only need to pay attention to instantaneous

parents for identiﬁability, and opens the door for ﬂexible lagged parent dependency. This theorem

assumes causal Markov, minimality, sufﬁciency and proper density assumptions.

Identiﬁability of history-dependent post non-linear model Next, we propose a novel gener-

alization of Rhino, called history-dependent PNL. Theorem 4 and Corollary 4.1 in Appendix B.2

prove it is bivariate identiﬁable w.r.t. instantaneous parents (i.e. satisfy the conditions in Theorem 3)

with additional assumptions (1), (2) and (3) in Theorem 1. The history-dependent PNL is deﬁned as

t=νit fit Pai

G(< t),Pai

G(t)+git Pai

G(< t), it,Pai

G(< t),

where νis invertible w.r.t. the ﬁrst argument. The bivariate identiﬁability of Rhino directly follows

from this, since Rhino is a special case with νbeing the identity mapping.

Generalization to multivariate case In the end, inspired by Peters et al. (2012), we prove the

above bivariate identiﬁability can be generalized to the multivariate case. Refer to Theorem 5 in

Appendix B.3 for details.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RHINO:DEEPCAUSALTEMPORALRELATIONSHIPLEARNINGWITHHISTORY-DEPENDENTNOISEWenboGong,JoelJennings,ChengZhang&NickPawlowskiMicrosoftResearchCambridge,UKft-gongwenbo,joeljennings,cheng.zhang,nick.pawlowskig@microsoft.comABSTRACTDiscoveringcausalrelationshipsbetweendifferentvariablesfromtimeseriesdatahasbee...

展开>> 收起<<

RHINO D EEP CAUSAL TEMPORAL RELATIONSHIP LEARNING WITH HISTORY -DEPENDENT NOISE Wenbo Gong Joel Jennings Cheng Zhang Nick Pawlowski.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RHINO D EEP CAUSAL TEMPORAL RELATIONSHIP LEARNING WITH HISTORY -DEPENDENT NOISE Wenbo Gong Joel Jennings Cheng Zhang Nick Pawlowski

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: