Towards Out-of-Distribution Sequential Event Prediction A Causal Treatment Chenxiao Yang1 Qitian Wu1 Qingsong Wen2 Zhiqiang Zhou2 Liang Sun2 Junchi Yan1

2025-05-06 0 0 4.47MB 18 页 10玖币
侵权投诉
Towards Out-of-Distribution Sequential Event
Prediction: A Causal Treatment
Chenxiao Yang1, Qitian Wu1, Qingsong Wen2, Zhiqiang Zhou2, Liang Sun2, Junchi Yan1
1Department of Computer Science and Engineering, Shanghai Jiao Tong University
2DAMO Academy, Alibaba Group
{chr26195,echo740,yanjunchi}@sjtu.edu.cn,
{qingsong.wen,zhouzhiqiang.zzq,liang.sun}@alibaba-inc.com
Abstract
The goal of sequential event prediction is to estimate the next event based on a
sequence of historical events, with applications to sequential recommendation, user
behavior analysis and clinical treatment. In practice, the next-event prediction
models are trained with sequential data collected at one time and need to generalize
to newly arrived sequences in remote future, which requires models to handle
temporal distribution shift from training to testing. In this paper, we first take a
data-generating perspective to reveal a negative result that existing approaches
with maximum likelihood estimation would fail for distribution shift due to the
latent context confounder, i.e., the common cause for the historical events and the
next event. Then we devise a new learning objective based on backdoor adjust-
ment and further harness variational inference to make it tractable for sequence
learning problems. On top of that, we propose a framework with hierarchical
branching structures for learning context-specific representations. Comprehensive
experiments on diverse tasks (e.g., sequential recommendation) demonstrate the
effectiveness, applicability and scalability of our method with various off-the-shelf
models as backbones.
1 Introduction
Real-world problem scenarios are flooded with sequential event data consisting of chronologically
arrived events which reflect certain behaviors, activities or responses. A typical example is user
activity prediction [
56
;
12
;
62
;
20
;
44
;
64
] that aims to harness user’s recent activities to estimate
future ones, which could help downstream target advertisement in online platforms (e.g., e-commerce
or social networks). Predicting future events also plays a central role in some real situations such as
clinical treatment [3] for promoting social welfare.
A common nature of (sequential) event prediction lies in the different time intervals within which
training and testing data are generated. Namely, models trained with data collected at one time
are supposed to predict next event in the future [
59
;
36
;
41
] where the underlying data-generating
distributions may have gone through variation due to environmental changes. However, most existing
approaches [
20
;
14
;
37
;
24
;
44
;
39
;
19
;
18
] overlook this issue in both problem formulation and
empirical evaluation, which may leave the real problems under-resolved with model mis-specification,
and result in over-estimation for model performance on real data.
As a concrete example in sequential recommenders [
11
], the data-generating distribution for user
clicking behaviors over items is highly dependent on user preferences that are normally correlated
The SJTU authors are also with MoE Key Lab of Artificial Intelligence, SJTU. Junchi Yan is the correspon-
dence author who is also with Shanghai AI Laboratory.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.13005v2 [cs.LG] 15 Jan 2023
Winter
......
Past Events Future Event
Spurious Correlation
......
Past Events Prediction
Spurious Correlation
......
Past Events Future Event
Causality
......
Past Events Prediction
Causality
Time Line
Train
(Offline)
Test
(Online)
Time Gap
Summer
Time Gap
Prior Art
Ours v.s.
Figure 1: A toy example in rec-
ommendation. Prior art would spu-
riously correlate non-causal items
(‘ice cream’ and ‘beach T-shirt’)
and produce undesired results un-
der a new environment in the future.
Our approach endeavors to allevi-
ate the issue via counteracting the
effects of contexts (seasons).
with some external factors, like the dynamic fashion trends in different years [
13
] or seasonal
influences on item popularity [
43
]. These highly time-sensitive external factors would induce distinct
user preferences leading to different behavioral data as time goes by. This issue could also partially
explain the common phenomenon of model performance drop after adaptation from offline to online
environments [41; 36; 22].
Handling distribution shift in sequential event data poses several non-trivial challenges. First, the
temporal shift requires models’ capability for out-of-distribution (OoD) generalization [
30
], i.e.,
extrapolating from training environments to new unseen environments in remote future. Prior art
that focus on model learning and evaluation over in-distribution data may yield sub-optimal results
on OoD testing instances. Second, as mentioned before, there exist external factors that impact the
generation of events. These external factors, which we term as contexts, are unobserved in practice.
To eliminate their effects, one may also need the latent distributions that characterize how the contexts
affect events generation. Unfortunately, such information is often inaccessible due to constrained
data collection, which requires the models to learn from pure observed sequences.
1.1 Our Contributions
To resolve these difficulties, in this paper, we adopt a generative perspective to investigate the
temporal distribution shift problem and propose a new variational context adjustment approach with
instantiations to solve the issue for sequential event prediction.
A generative perspective on temporal distribution shift.
We use proof-of-concept Structural
Causal Models (SCMs) to characterize the dependency among contexts, historical sequences and
labels (i.e., next event type) in terms of both data generation and model prediction. We show that
contexts essentially act as a confounder, which leads the model to leverage spurious correlations and
fail to generalize to data from new distributions. See Fig. 1 for a toy example to illustrate the issue.
Variational context adjustment.
We propose a variational context adjustment approach to resolve
the issue. The resulting new learning objective has two-fold effects: 1) helping to uncover the
underlying relations between latent contexts and sequences in a data-driven way, and 2) canceling out
the confounding effect of contexts to facilitate the model to explore the true causality of interest (as
illustrated in Fig. 1). We also propose a mixture of variational posteriors to approximate context prior
via randomly simulating pseudo input sequences.
Instantiation by a flexible framework.
We propose a framework named as CaseQ to instantiate
ingredients in the objective, which could combine with most off-the-shelf sequence backbone models.
To accommodate temporal patterns under different environments, we devise a novel hierarchical
branching structure for learning context-specific representations of sequences. It could dynamically
evolves its architecture to adapt for variable contexts.
Empirical results.
We carry out comprehensive experiments on three sequential event prediction
tasks with valuation protocols designed for testing model performance under temporal distribution
shift. Specifically, when we enlarge time gap between training and testing data, CaseQ can alleviate
performance drop by
47.77%
w.r.t. Normalized Discounted Cumulative Gain (NDCG) and
35.73%
w.r.t. Hit Ratio (HR) for sequential recommendation, which shows its robustness against temporal
distribution shift.
2
1.2 Related Works
Sequential Event Prediction
[
25
] aims to predict the next event given a historical event sequence
with applications including item recommendation in online commercial platforms, user behavior
prediction in social networks, and symptom prediction for clinical treatment. Early works [
39
;
19
;
18
]
rely on Markov chains and Bayesian networks to learn event associations. Later, deep learning based
methods are proposed to capture non-linear temporal dynamics using CNN [
45
], RNN [
20
;
27
;
54
],
attention models [
24
;
50
;
51
], etc. These models are usually trained offline using event sequences
collected during a certain period of time. Due to temporal distribution shifts, they may not generalize
well to online environments [
41
;
36
;
22
]. Our work develops a principled approach for enhancing
sequential event prediction with better robustness against temporal distribution shifts via pursing
causality behind data.
Out-of-Distribution Generalization
[
30
;
4
;
31
;
63
] deals with distinct training and testing distri-
butions. It is a common yet challenging problem, and has received increasing attention due to its
significance [
26
;
42
;
52
;
55
]. For event prediction, distribution shift inherently exists because of
different time intervals from which training and testing data are generated. Most existing sequence
learning models assume that data are independent and identically distributed [
20
;
14
;
37
]. Despite
various approaches of OoD generalization designed for machine learning problems in other research
areas (e.g., vision and texts), it still remains under-explored for sequential event prediction and
sequential recommender systems.
Causal Inference
[
33
;
35
] is a fundamental way to identity causal relations among variables, and
pursuit stable and robust learning and inference. It has received wide attention and been applied in
various domains e.g., computer vision [
57
;
28
], natural language processing [
40
;
32
] and recommender
systems [
58
;
1
]. Some existing causal frameworks for user behavior modeling [
38
;
16
;
60
;
61
;
17
]
aim to extract causal relations based on observed or predefined patterns, yet they often require domain
knowledge or side information for guidance and also do not consider temporal distribution shift.
[
1
;
2
;
6
;
48
] adopt counterfactual learning to overcome the effect of an ad-doc bias in recommendation
task (e.g., exposure bias, popularity bias) or mitigate clickbait issue, whereas they do not focus on
modeling sequential events and are still based on MLE as learning objectives.
2 Problem and Model Formulation
We denote
X={1,2,· · · , M}
as the space of event types, and assume each event is assigned
with an event type
xi∈ X
. Events occurred in chronological order consist of a sequence
S=
{x1, x2,· · · , x|S|}
. As mentioned before, the data distribution is normally affected by time-dependent
external factors, i.e., context
c
. We use
S
,
Y
,
ˆ
Y
and
C
to denote the random variables of historical
event sequence
S
, ground-truth next event
y
, predicted next event
ˆy
and context
c
, respectively. The
data distribution can be characterized as P(S, Y |C) = P(S|C)P(Y|S, C).
Problem Formulation.
Given training data
{(Si, yi)}N
i=1
generated from data distributions with
(Si, yi)P(S, Y |C=c(i)
tr )
, where
c(i)
tr
denotes the specific context when the
i
-th training sample
is generated, we are to learn a prediction model
ˆyi=f(Si;θ)
that can generalize to testing data
{(Sj, yj)}N0
j=1
from new distributions with
(Sj, yj)P(S, Y |C=c(j)
te )
, where
c(j)
te
denotes the
specific context when the
j
-th testing sample is generated. The distribution shift stems from different
contexts that change over time, which we call temporal distribution shift in this paper.
2.1 Understanding the Limitations of Maximum Likelihood Estimation
Most of existing approaches target maximizing the likelihood
Pθ(y|S)
as the objective for optimiza-
tion. Here we use
Pθ(·)
to denote the distribution induced by prediction model
fθ
. Based on the
definitions, we can build two Structural Causal Models (SCMs) that interpret the causal relations
among 1)
C
,
S
,
Y
(given by data-generating process) and 2)
C
,
S
and
ˆ
Y
(given by model learning),
as shown in Fig. 2(a) and (b).
For Fig. 2(a), the three causal relations are given by the definitions for data generation, i.e.,
P(S, Y |C) = P(S|C)P(Y|S, C)
. We next illustrate the rationales behind the other two causal
relations Sˆ
Yand Cˆ
Y.
3
(b) Traditional model (c) Our interventional
model
Variational Inference
Deconfound
(a) Real-world data
generation
Figure 2: Structural causal model for sequence learning.
Sˆ
Y
: This relation is induced by the prediction model
ˆy=f(S;θ)
that takes a historical
event sequence
S
as input and outputs the prediction for next event
ˆy
. The relation from
S
to
ˆy
is
deterministic given fixed model parameters θ.
Cˆ
Y
: This relation is implicitly embodied in the learning process that optimizes the model
parameters with a given dataset collected at one time. By our definition, the training dataset is
generated from a latent distribution affected by context ctr and the MLE algorithm yields
θ= arg min
θ
E(S,y)P(S,Y |C=ctr )[l(f(S;θ), y)],(1)
where
l(·,·)
denotes a certain loss function, e.g., cross-entropy. This indicates that the learned model
parameters
θ
is dependent on the distribution of
ctr
. Also, due to the fact
ˆy=f(S;θ)
, we conclude
the proof for relation from
C
to
ˆ
Y
. In short, intuitive explanations for such a causal relation lie in
two facts: 1)
C
affects the generation of data used for model training, and 2)
ˆ
Y
is the output of the
trained model given input sequence S.
Confounding Effect of C.
The key observation is that
C
acts as the confounder in both Fig. 2(a)
and (b), which could play a crucial role leading to undesirable testing performance of MLE-based
approaches once the distribution is shifted. As implied by the causal relations in Fig. 2(a), there exists
partial information in
S
that is predictive for
Y
yet highly sensitive to
C
, a usually fast-changing
variable in real-world scenarios. As a result, the correlation between
S
and
Y
in previous contexts may
become spurious in future ones. This also explains the failure of MLE-based models for generalizing
to data from new distributions, according to the similar causal pattern in Fig. 2(b).
We can reuse the toy example in Fig. 1 as an intuitive interpretation for the failure. The ‘summer’
season (a context) acts as a confounder that correlates buying ‘ice cream’ (a historical event) and
buying ‘T-shirt’ (a label), between which exists no obvious causal relation. However, the model
would ‘memorize’ their correlation and tend to improperly recommend ‘T-shirt’ in the ‘winter’ season
(new context). Whereas in fact the user purchases ice cream in winter most possibly because he/she
is a dessert lover, in which case recommending other desserts would be a better decision.
Intervention.
To address the confounding effect of
C
and endow the model with robustness to
temporal distribution shift, we propose to target model learning with
Pθ(Y|do(S))
instead of the
conventional
Pθ(Y|S)
. As shown in Fig. 5(c), the
do
-operator cuts off the arrow (i.e., causal relation)
coming from
C
to
S
, which essentially simulates an ideal data-generating process where sequences
are generated independently from contexts. This operation blocks the backdoor path
SCˆ
Y
that spuriously correlates
S
and
Y
, and enables the model to learn the desired causal relation
SY
which is invariant to environmental change.
2.2 Variational Context Adjustment
An ideal way to compute
Pθ(Y|do(S))
is to carry out randomized controlled trial (RCT) [
34
]
by recollecting data from a prohibitively large quantity of random samples under any possible
context, which is infeasible since we could neither control the environment nor collect data in the
future. Fortunately, there exists a statistical estimation of
Pθ(Y|do(S))
by leveraging backdoor
adjustment [
34
], wherein the confounder
C
is stratified into discrete pieces
C={ci}|C|
i=1
. By using
4
摘要:

TowardsOut-of-DistributionSequentialEventPrediction:ACausalTreatmentChenxiaoYang1,QitianWu1,QingsongWen2,ZhiqiangZhou2,LiangSun2,JunchiYan11DepartmentofComputerScienceandEngineering,ShanghaiJiaoTongUniversity2DAMOAcademy,AlibabaGroup{chr26195,echo740,yanjunchi}@sjtu.edu.cn,{qingsong.wen,zhouzhiqian...

展开>> 收起<<
Towards Out-of-Distribution Sequential Event Prediction A Causal Treatment Chenxiao Yang1 Qitian Wu1 Qingsong Wen2 Zhiqiang Zhou2 Liang Sun2 Junchi Yan1.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:4.47MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注