Towards Out-of-Distribution Sequential Event Prediction A Causal Treatment Chenxiao Yang1 Qitian Wu1 Qingsong Wen2 Zhiqiang Zhou2 Liang Sun2 Junchi Yan1

2025-05-06 0 0 4.47MB 18 页 10玖币

侵权投诉

Towards Out-of-Distribution Sequential Event

Prediction: A Causal Treatment

Chenxiao Yang1, Qitian Wu1, Qingsong Wen2, Zhiqiang Zhou2, Liang Sun2, Junchi Yan1∗

1Department of Computer Science and Engineering, Shanghai Jiao Tong University

2DAMO Academy, Alibaba Group

{chr26195,echo740,yanjunchi}@sjtu.edu.cn,

{qingsong.wen,zhouzhiqiang.zzq,liang.sun}@alibaba-inc.com

Abstract

The goal of sequential event prediction is to estimate the next event based on a

sequence of historical events, with applications to sequential recommendation, user

behavior analysis and clinical treatment. In practice, the next-event prediction

models are trained with sequential data collected at one time and need to generalize

to newly arrived sequences in remote future, which requires models to handle

temporal distribution shift from training to testing. In this paper, we ﬁrst take a

data-generating perspective to reveal a negative result that existing approaches

with maximum likelihood estimation would fail for distribution shift due to the

latent context confounder, i.e., the common cause for the historical events and the

next event. Then we devise a new learning objective based on backdoor adjust-

ment and further harness variational inference to make it tractable for sequence

learning problems. On top of that, we propose a framework with hierarchical

branching structures for learning context-speciﬁc representations. Comprehensive

experiments on diverse tasks (e.g., sequential recommendation) demonstrate the

effectiveness, applicability and scalability of our method with various off-the-shelf

models as backbones.

1 Introduction

Real-world problem scenarios are ﬂooded with sequential event data consisting of chronologically

arrived events which reﬂect certain behaviors, activities or responses. A typical example is user

activity prediction [

;

] that aims to harness user’s recent activities to estimate

future ones, which could help downstream target advertisement in online platforms (e.g., e-commerce

or social networks). Predicting future events also plays a central role in some real situations such as

clinical treatment [3] for promoting social welfare.

A common nature of (sequential) event prediction lies in the different time intervals within which

training and testing data are generated. Namely, models trained with data collected at one time

are supposed to predict next event in the future [

;

] where the underlying data-generating

distributions may have gone through variation due to environmental changes. However, most existing

approaches [

;

] overlook this issue in both problem formulation and

empirical evaluation, which may leave the real problems under-resolved with model mis-speciﬁcation,

and result in over-estimation for model performance on real data.

As a concrete example in sequential recommenders [

], the data-generating distribution for user

clicking behaviors over items is highly dependent on user preferences that are normally correlated

∗

The SJTU authors are also with MoE Key Lab of Artiﬁcial Intelligence, SJTU. Junchi Yan is the correspon-

dence author who is also with Shanghai AI Laboratory.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.13005v2 [cs.LG] 15 Jan 2023

Winter

......

Past Events Future Event

Spurious Correlation

......

Past Events Prediction

Spurious Correlation

......

Past Events Future Event

Causality

......

Past Events Prediction

Causality

Time Line

Train

(Offline)

Test

(Online)

Time Gap

Summer

Time Gap

Prior Art

Ours v.s.

Figure 1: A toy example in rec-

ommendation. Prior art would spu-

riously correlate non-causal items

(‘ice cream’ and ‘beach T-shirt’)

and produce undesired results un-

der a new environment in the future.

Our approach endeavors to allevi-

ate the issue via counteracting the

effects of contexts (seasons).

with some external factors, like the dynamic fashion trends in different years [

] or seasonal

inﬂuences on item popularity [

]. These highly time-sensitive external factors would induce distinct

user preferences leading to different behavioral data as time goes by. This issue could also partially

explain the common phenomenon of model performance drop after adaptation from ofﬂine to online

environments [41; 36; 22].

Handling distribution shift in sequential event data poses several non-trivial challenges. First, the

temporal shift requires models’ capability for out-of-distribution (OoD) generalization [

], i.e.,

extrapolating from training environments to new unseen environments in remote future. Prior art

that focus on model learning and evaluation over in-distribution data may yield sub-optimal results

on OoD testing instances. Second, as mentioned before, there exist external factors that impact the

generation of events. These external factors, which we term as contexts, are unobserved in practice.

To eliminate their effects, one may also need the latent distributions that characterize how the contexts

affect events generation. Unfortunately, such information is often inaccessible due to constrained

data collection, which requires the models to learn from pure observed sequences.

1.1 Our Contributions

To resolve these difﬁculties, in this paper, we adopt a generative perspective to investigate the

temporal distribution shift problem and propose a new variational context adjustment approach with

instantiations to solve the issue for sequential event prediction.

A generative perspective on temporal distribution shift.

We use proof-of-concept Structural

Causal Models (SCMs) to characterize the dependency among contexts, historical sequences and

labels (i.e., next event type) in terms of both data generation and model prediction. We show that

contexts essentially act as a confounder, which leads the model to leverage spurious correlations and

fail to generalize to data from new distributions. See Fig. 1 for a toy example to illustrate the issue.

Variational context adjustment.

We propose a variational context adjustment approach to resolve

the issue. The resulting new learning objective has two-fold effects: 1) helping to uncover the

underlying relations between latent contexts and sequences in a data-driven way, and 2) canceling out

the confounding effect of contexts to facilitate the model to explore the true causality of interest (as

illustrated in Fig. 1). We also propose a mixture of variational posteriors to approximate context prior

via randomly simulating pseudo input sequences.

Instantiation by a ﬂexible framework.

We propose a framework named as CaseQ to instantiate

ingredients in the objective, which could combine with most off-the-shelf sequence backbone models.

To accommodate temporal patterns under different environments, we devise a novel hierarchical

branching structure for learning context-speciﬁc representations of sequences. It could dynamically

evolves its architecture to adapt for variable contexts.

Empirical results.

We carry out comprehensive experiments on three sequential event prediction

tasks with valuation protocols designed for testing model performance under temporal distribution

shift. Speciﬁcally, when we enlarge time gap between training and testing data, CaseQ can alleviate

performance drop by

47.77%

w.r.t. Normalized Discounted Cumulative Gain (NDCG) and

35.73%

w.r.t. Hit Ratio (HR) for sequential recommendation, which shows its robustness against temporal

distribution shift.

1.2 Related Works

Sequential Event Prediction

[

] aims to predict the next event given a historical event sequence

with applications including item recommendation in online commercial platforms, user behavior

prediction in social networks, and symptom prediction for clinical treatment. Early works [

;

]

rely on Markov chains and Bayesian networks to learn event associations. Later, deep learning based

methods are proposed to capture non-linear temporal dynamics using CNN [

], RNN [

;

attention models [

;

], etc. These models are usually trained ofﬂine using event sequences

collected during a certain period of time. Due to temporal distribution shifts, they may not generalize

well to online environments [

;

]. Our work develops a principled approach for enhancing

sequential event prediction with better robustness against temporal distribution shifts via pursing

causality behind data.

Out-of-Distribution Generalization

[

;

] deals with distinct training and testing distri-

butions. It is a common yet challenging problem, and has received increasing attention due to its

signiﬁcance [

;

]. For event prediction, distribution shift inherently exists because of

different time intervals from which training and testing data are generated. Most existing sequence

learning models assume that data are independent and identically distributed [

;

]. Despite

various approaches of OoD generalization designed for machine learning problems in other research

areas (e.g., vision and texts), it still remains under-explored for sequential event prediction and

sequential recommender systems.

Causal Inference

[

;

] is a fundamental way to identity causal relations among variables, and

pursuit stable and robust learning and inference. It has received wide attention and been applied in

various domains e.g., computer vision [

;

], natural language processing [

;

] and recommender

systems [

;

]. Some existing causal frameworks for user behavior modeling [

;

]

aim to extract causal relations based on observed or predeﬁned patterns, yet they often require domain

knowledge or side information for guidance and also do not consider temporal distribution shift.

[

;

] adopt counterfactual learning to overcome the effect of an ad-doc bias in recommendation

task (e.g., exposure bias, popularity bias) or mitigate clickbait issue, whereas they do not focus on

modeling sequential events and are still based on MLE as learning objectives.

2 Problem and Model Formulation

We denote

X={1,2,· · · , M}

as the space of event types, and assume each event is assigned

with an event type

xi∈ X

. Events occurred in chronological order consist of a sequence

{x1, x2,· · · , x|S|}

. As mentioned before, the data distribution is normally affected by time-dependent

external factors, i.e., context

. We use

and

to denote the random variables of historical

event sequence

, ground-truth next event

, predicted next event

ˆy

and context

, respectively. The

data distribution can be characterized as P(S, Y |C) = P(S|C)P(Y|S, C).

Problem Formulation.

Given training data

{(Si, yi)}N

i=1

generated from data distributions with

(Si, yi)∼P(S, Y |C=c(i)

tr )

, where

c(i)

denotes the speciﬁc context when the

-th training sample

is generated, we are to learn a prediction model

ˆyi=f(Si;θ)

that can generalize to testing data

{(Sj, yj)}N0

j=1

from new distributions with

(Sj, yj)∼P(S, Y |C=c(j)

te )

, where

c(j)

denotes the

speciﬁc context when the

-th testing sample is generated. The distribution shift stems from different

contexts that change over time, which we call temporal distribution shift in this paper.

2.1 Understanding the Limitations of Maximum Likelihood Estimation

Most of existing approaches target maximizing the likelihood

Pθ(y|S)

as the objective for optimiza-

tion. Here we use

Pθ(·)

to denote the distribution induced by prediction model

fθ

. Based on the

deﬁnitions, we can build two Structural Causal Models (SCMs) that interpret the causal relations

among 1)

(given by data-generating process) and 2)

and

(given by model learning),

as shown in Fig. 2(a) and (b).

For Fig. 2(a), the three causal relations are given by the deﬁnitions for data generation, i.e.,

P(S, Y |C) = P(S|C)P(Y|S, C)

. We next illustrate the rationales behind the other two causal

relations S→ˆ

Yand C→ˆ

(b) Traditional model (c) Our interventional

model

Variational Inference

Deconfound

(a) Real-world data

generation

Figure 2: Structural causal model for sequence learning.

S→ˆ

: This relation is induced by the prediction model

ˆy=f(S;θ)

that takes a historical

event sequence

as input and outputs the prediction for next event

ˆy

. The relation from

ˆy

deterministic given ﬁxed model parameters θ.

C→ˆ

: This relation is implicitly embodied in the learning process that optimizes the model

parameters with a given dataset collected at one time. By our deﬁnition, the training dataset is

generated from a latent distribution affected by context ctr and the MLE algorithm yields

θ∗= arg min

E(S,y)∼P(S,Y |C=ctr )[l(f(S;θ), y)],(1)

where

l(·,·)

denotes a certain loss function, e.g., cross-entropy. This indicates that the learned model

parameters

θ∗

is dependent on the distribution of

ctr

. Also, due to the fact

ˆy=f(S;θ)

, we conclude

the proof for relation from

. In short, intuitive explanations for such a causal relation lie in

two facts: 1)

affects the generation of data used for model training, and 2)

is the output of the

trained model given input sequence S.

Confounding Effect of C.

The key observation is that

acts as the confounder in both Fig. 2(a)

and (b), which could play a crucial role leading to undesirable testing performance of MLE-based

approaches once the distribution is shifted. As implied by the causal relations in Fig. 2(a), there exists

partial information in

that is predictive for

yet highly sensitive to

, a usually fast-changing

variable in real-world scenarios. As a result, the correlation between

and

in previous contexts may

become spurious in future ones. This also explains the failure of MLE-based models for generalizing

to data from new distributions, according to the similar causal pattern in Fig. 2(b).

We can reuse the toy example in Fig. 1 as an intuitive interpretation for the failure. The ‘summer’

season (a context) acts as a confounder that correlates buying ‘ice cream’ (a historical event) and

buying ‘T-shirt’ (a label), between which exists no obvious causal relation. However, the model

would ‘memorize’ their correlation and tend to improperly recommend ‘T-shirt’ in the ‘winter’ season

(new context). Whereas in fact the user purchases ice cream in winter most possibly because he/she

is a dessert lover, in which case recommending other desserts would be a better decision.

Intervention.

To address the confounding effect of

and endow the model with robustness to

temporal distribution shift, we propose to target model learning with

Pθ(Y|do(S))

instead of the

conventional

Pθ(Y|S)

. As shown in Fig. 5(c), the

-operator cuts off the arrow (i.e., causal relation)

coming from

, which essentially simulates an ideal data-generating process where sequences

are generated independently from contexts. This operation blocks the backdoor path

S←C→ˆ

that spuriously correlates

and

, and enables the model to learn the desired causal relation

S→Y

which is invariant to environmental change.

2.2 Variational Context Adjustment

An ideal way to compute

Pθ(Y|do(S))

is to carry out randomized controlled trial (RCT) [

]

by recollecting data from a prohibitively large quantity of random samples under any possible

context, which is infeasible since we could neither control the environment nor collect data in the

future. Fortunately, there exists a statistical estimation of

Pθ(Y|do(S))

by leveraging backdoor

adjustment [

], wherein the confounder

is stratiﬁed into discrete pieces

C={ci}|C|

i=1

. By using

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsOut-of-DistributionSequentialEventPrediction:ACausalTreatmentChenxiaoYang1,QitianWu1,QingsongWen2,ZhiqiangZhou2,LiangSun2,JunchiYan11DepartmentofComputerScienceandEngineering,ShanghaiJiaoTongUniversity2DAMOAcademy,AlibabaGroup{chr26195,echo740,yanjunchi}@sjtu.edu.cn,{qingsong.wen,zhouzhiqian...

展开>> 收起<<

Towards Out-of-Distribution Sequential Event Prediction A Causal Treatment Chenxiao Yang1 Qitian Wu1 Qingsong Wen2 Zhiqiang Zhou2 Liang Sun2 Junchi Yan1.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Out-of-Distribution Sequential Event Prediction A Causal Treatment Chenxiao Yang1 Qitian Wu1 Qingsong Wen2 Zhiqiang Zhou2 Liang Sun2 Junchi Yan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: