1.2 Related Works
Sequential Event Prediction
[
25
] aims to predict the next event given a historical event sequence
with applications including item recommendation in online commercial platforms, user behavior
prediction in social networks, and symptom prediction for clinical treatment. Early works [
39
;
19
;
18
]
rely on Markov chains and Bayesian networks to learn event associations. Later, deep learning based
methods are proposed to capture non-linear temporal dynamics using CNN [
45
], RNN [
20
;
27
;
54
],
attention models [
24
;
50
;
51
], etc. These models are usually trained offline using event sequences
collected during a certain period of time. Due to temporal distribution shifts, they may not generalize
well to online environments [
41
;
36
;
22
]. Our work develops a principled approach for enhancing
sequential event prediction with better robustness against temporal distribution shifts via pursing
causality behind data.
Out-of-Distribution Generalization
[
30
;
4
;
31
;
63
] deals with distinct training and testing distri-
butions. It is a common yet challenging problem, and has received increasing attention due to its
significance [
26
;
42
;
52
;
55
]. For event prediction, distribution shift inherently exists because of
different time intervals from which training and testing data are generated. Most existing sequence
learning models assume that data are independent and identically distributed [
20
;
14
;
37
]. Despite
various approaches of OoD generalization designed for machine learning problems in other research
areas (e.g., vision and texts), it still remains under-explored for sequential event prediction and
sequential recommender systems.
Causal Inference
[
33
;
35
] is a fundamental way to identity causal relations among variables, and
pursuit stable and robust learning and inference. It has received wide attention and been applied in
various domains e.g., computer vision [
57
;
28
], natural language processing [
40
;
32
] and recommender
systems [
58
;
1
]. Some existing causal frameworks for user behavior modeling [
38
;
16
;
60
;
61
;
17
]
aim to extract causal relations based on observed or predefined patterns, yet they often require domain
knowledge or side information for guidance and also do not consider temporal distribution shift.
[
1
;
2
;
6
;
48
] adopt counterfactual learning to overcome the effect of an ad-doc bias in recommendation
task (e.g., exposure bias, popularity bias) or mitigate clickbait issue, whereas they do not focus on
modeling sequential events and are still based on MLE as learning objectives.
2 Problem and Model Formulation
We denote
X={1,2,· · · , M}
as the space of event types, and assume each event is assigned
with an event type
xi∈ X
. Events occurred in chronological order consist of a sequence
S=
{x1, x2,· · · , x|S|}
. As mentioned before, the data distribution is normally affected by time-dependent
external factors, i.e., context
c
. We use
S
,
Y
,
ˆ
Y
and
C
to denote the random variables of historical
event sequence
S
, ground-truth next event
y
, predicted next event
ˆy
and context
c
, respectively. The
data distribution can be characterized as P(S, Y |C) = P(S|C)P(Y|S, C).
Problem Formulation.
Given training data
{(Si, yi)}N
i=1
generated from data distributions with
(Si, yi)∼P(S, Y |C=c(i)
tr )
, where
c(i)
tr
denotes the specific context when the
i
-th training sample
is generated, we are to learn a prediction model
ˆyi=f(Si;θ)
that can generalize to testing data
{(Sj, yj)}N0
j=1
from new distributions with
(Sj, yj)∼P(S, Y |C=c(j)
te )
, where
c(j)
te
denotes the
specific context when the
j
-th testing sample is generated. The distribution shift stems from different
contexts that change over time, which we call temporal distribution shift in this paper.
2.1 Understanding the Limitations of Maximum Likelihood Estimation
Most of existing approaches target maximizing the likelihood
Pθ(y|S)
as the objective for optimiza-
tion. Here we use
Pθ(·)
to denote the distribution induced by prediction model
fθ
. Based on the
definitions, we can build two Structural Causal Models (SCMs) that interpret the causal relations
among 1)
C
,
S
,
Y
(given by data-generating process) and 2)
C
,
S
and
ˆ
Y
(given by model learning),
as shown in Fig. 2(a) and (b).
For Fig. 2(a), the three causal relations are given by the definitions for data generation, i.e.,
P(S, Y |C) = P(S|C)P(Y|S, C)
. We next illustrate the rationales behind the other two causal
relations S→ˆ
Yand C→ˆ
Y.
3