
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA Hengyu Zhang, et al.
Figure 1: Illustration of common sequential recommenda-
tion models.
model (Figure 1a), which predicts the next item a user will in-
teract with based on her historical behaviors [
10
,
14
,
29
]. It is a
straightforward modeling choice for sequential data. However, in
the sequential recommendation, the autoregressive schema could
weaken the model’s expressiveness. Because in practice, the se-
quential dependencies of user behaviors may not strictly hold. For
example, after purchasing an iPad, a user may click on Apple pencil,
iPad case, and headphones. But it is likely that the user clicked
on these three products at random. Then simply modeling it in a
compulsory sequential order loses some overall contextual infor-
mation. Because future data (interactions that occur after the target
interaction) also provide rich collaborative information to assist
the model training. Therefore, it’s reasonable to leverage the future
data to train better sequential recommendation models.
Recently, researchers have proven that leveraging both past and
future contextual information during training will signicantly
boost recommendation performances compared to the autoregres-
sive models [
28
,
36
]. For example, inspired by the advances in the
eld of natural language processing (NLP), BERT4Rec [
28
] employs
a masked language model (MLM) training objective which predicts
masked items based on both historical and future behavior records
during training (Figure 1b). BERT4Rec signicantly improves rec-
ommendation performances compared to its unidirectional autore-
gressive counterpart SASRec [14].
Despite the richer contextual information brought by training
with future interaction data, simply adopting MLM objectives for
SR can introduce a severe
training-inference gap
. Specically, at
training time, the MLM model predicts masked items with both
past and future interactions as context, which can be illustrated
as
𝑃(𝑖|x𝑝𝑎𝑠𝑡 ,x𝑓 𝑢𝑡𝑢𝑟 𝑒 )
. While at inference, only past behaviors are
available available for prediction, i.e.,
𝑃(𝑖|x𝑝𝑎𝑠𝑡 ,NULL)
. The discrep-
ancy of context between training and inference can bias the model
during inference and lead to potential performance degradation.
To exploit richer contextual information from the future while al-
leviating the potential training-inference gap. The following model
desiderata should be met:
•Past-future disentanglement
: The training-inference gap in
existing methods is caused by the use of a single encoder predictor
that entangles past and future contextual information, thus mess-
ing with inference. Instead, the future data should be modeled
in a separate way without explicitly interfering with modeling
historical interaction data. If both disentangled encoders get well-
trained, the absence of future information will not degrade the
performance of the past information encoder. By this means, we
can use only past behaviors for inference, with a minimal gap
between training and inference.
•Past-future mutual enhancement
: Users’ interests captured
by past and future behaviors are closely related and comple-
mentary. Simply separating past and future modeling processes
hinders leveraging knowledge learned from each other. To better
exploit future data, an elegant way is to have the disentangled
past-future modeling process mutually enhance each other.
In this article, we propose a framework for better utilization of
past and future information in sequential recommendation, named
DualRec
. To alleviate the training-inference gap, DualRec adopts
a dual network structure. For a target interaction, past and future
contextual behaviors are modeled by two encoders, respectively.
The two encoders perform dual tasks, i.e. the past encoder performs
next-item prediction (primal) while the future encoder performs
previous-item-prediction (dual). Future information is decoupled
from the modeling of past information in this way. During inference,
only the past encoder is used to make predictions, thus avoiding the
training-inference gap. Secondly, dual network enhance each other
through a multi-scale knowledge transferring. Specically, the in-
ternal representations of two networks are constraint to alignment
based on the assumption that users’ interests captured by past and
future behaviors are closely related and complementary. Finally, as
a general framework, DualRec can be instantiated using dierent
backbone models, including RNN, Transformer and lter-based
MLP.
To summarize, our contributions are as follows:
•
We highlight the training-inference gap existed in sequential rec-
ommendation models when leveraging future data. To handle this
problem, we propose a novel framework DualRec that achieves
the disentanglement and mutual enhancement of past-future
modeling.
•
DualRec explicitly decouples the past information and future
information modeling into two separate encoders, thus alleviat-
ing the training-inference gap, and further using a past-future
knowledge transferring to learn an enhanced representation.
•
We conduct comprehensive experiments on four public datasets.
Experimental results demonstrate the eectiveness of our pro-
posed DualRec as compared with several baseline models. Further
analysis illustrates its compatibility.
2 RELATED WORK
Sequential recommenders are designed to model the sequential dy-
namics in user behaviors. Early eorts leverage the Markov Chain
(MC) assumption [
25
] and model item-item transition to predict
user’s next action based on the last visited items. Recently, dif-
ferent neural network-based models have been applied, including
Recurrent Neural Networks (RNNs) [
20
], Convolutional Neural
Networks (CNNs) [
17
], Attention Networks [
31
] and Graph Neural
Networks [
16
]. GRU4Rec [
10
] is a pioneering work that employs
RNN to capture the dynamic characteristic of user behaviors. Hidasi
and Karatzoglou further extends GRU4Rec with enhanced ranking
functions as well as eective sampling strategies. Another line of re-
search is based on CNN. Caser [
29
] treats the embedding matrix of
items as a 2D image and models user behavior sequences with con-
volution. The main advantage of CNN-based models is that they are
much easier to be parallelized on GPUs compared with RNN-based