Disentangling Past-Future Modeling in Sequential Recommendation via Dual Networks

2025-05-03 0 0 1.29MB 11 页 10玖币
侵权投诉
Disentangling Past-Future Modeling in Sequential
Recommendation via Dual Networks
Hengyu Zhang
zhang-hy21@mails.tsinghua.edu.cn
Tsinghua Shenzhen International
Graduate School, Tsinghua University
Shenzhen, China
Enming Yuan
yem19@mails.tsinghua.edu.cn
Institute for Interdisciplinary
Information Sciences,Tsinghua
University
Beijing, China
Wei Guo
guowei67@huawei.com
Huawei Noah’s Ark Lab
Shenzhen, China
Zhicheng He
hezhicheng9@huawei.com
Huawei Noah’s Ark Lab
Shenzhen, China
Jiarui Qin
qinjr@icloud.com
Shanghai Jiao Tong University
Shanghai, China
Huifeng Guo
huifeng.guo@huawei.com
Huawei Noah’s Ark Lab
Shenzhen, China
Bo Chen
chenbo116@huawei.com
Huawei Noah’s Ark Lab
Shenzhen, China
Xiu Li
li.xiu@sz.tsinghua.edu.cn
Tsinghua Shenzhen International
Graduate School, Tsinghua University
Shenzhen, China
Ruiming Tang
tangruiming@huawei.com
Huawei Noah’s Ark Lab
Shenzhen, China
ABSTRACT
Sequential recommendation (SR) plays an important role in per-
sonalized recommender systems because it captures dynamic and
diverse preferences from users’ real-time increasing behaviors. Un-
like the standard autoregressive training strategy, future data (also
available during training) has been used to facilitate model training
as it provides richer signals about user’s current interests and can
be used to improve the recommendation quality. However, these
methods suer from a severe training-inference gap, i.e., both past
and future contexts are modeled by the same encoder when train-
ing, while only historical behaviors are available during inference.
This discrepancy leads to potential performance degradation. To
alleviate the training-inference gap, we propose a new framework
DualRec, which achieves past-future disentanglement and past-
future mutual enhancement by a novel dual network. Specically,
a dual network structure is exploited to model the past and future
context separately. And a bi-directional knowledge transferring
mechanism enhances the knowledge learnt by the dual network.
Extensive experiments on four real-world datasets demonstrate the
superiority of our approach over baseline methods. Besides, we
demonstrate the compatibility of DualRec by instantiating using
RNN, Transformer, and lter-MLP as backbones. Further empirical
analysis veries the high utility of modeling future contexts under
Work done when they were research interns at Huawei Noah’s Ark Lab, and both
authors contributed equally to this research.
Corresponding author.
This work is licensed under a Creative Commons Attribution
International 4.0 License.
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA
©2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9236-5/22/10.
https://doi.org/10.1145/3511808.3557289
our DualRec framework. The code of DualRec is publicly available
at https://github.com/zhy99426/DualRec.
CCS CONCEPTS
Information systems Recommender systems.
KEYWORDS
Sequential recommendation; training-inference gap; dual network
ACM Reference Format:
Hengyu Zhang, Enming Yuan, Wei Guo, Zhicheng He, Jiarui Qin, Huifeng
Guo, Bo Chen, Xiu Li, and Ruiming Tang. 2022. Disentangling Past-Future
Modeling in Sequential Recommendation via Dual Networks. In Proceedings
of the 31st ACM Int’l Conference on Information and Knowledge Management
(CIKM ’22), Oct. 17–21, 2022, Atlanta, GA, USA. ACM, New York, NY, USA,
11 pages. https://doi.org/10.1145/3511808.3557289
1 INTRODUCTION
Recommender systems have been widely deployed in online service
platforms, ranging from online advertising and retailing [
3
,
7
,
32
]
to music and video recommendation [
4
,
5
,
30
]. Generally, users’
interests are dynamic and evolve over time, which are depicted
by users’ sequential interactions. Therefore, it leads to Sequential
Recommendation (SR) that models the sequential characteristics
of users’ behaviors and provides more precise and customized ser-
vices. To make accurate predictions, it’s essential to learn eective
representation for users based on the historical interactions they
have engaged with. Over the years, great eorts have been devoted,
and dierent model architectures are proposed for sequential rec-
ommendation, including Recurrent Neural Network (RNN) [
10
],
Convolutional Neural Networks (CNN) [
29
], Self-Attention Net-
work (SAN)[14], and Graph Neural Networks (GNN) [23, 33].
Typically, the sequential recommendation problem is formulated
as a next-item-prediction problem, also known as autoregressive
arXiv:2210.14577v2 [cs.IR] 11 Jan 2023
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA Hengyu Zhang, et al.
Figure 1: Illustration of common sequential recommenda-
tion models.
model (Figure 1a), which predicts the next item a user will in-
teract with based on her historical behaviors [
10
,
14
,
29
]. It is a
straightforward modeling choice for sequential data. However, in
the sequential recommendation, the autoregressive schema could
weaken the model’s expressiveness. Because in practice, the se-
quential dependencies of user behaviors may not strictly hold. For
example, after purchasing an iPad, a user may click on Apple pencil,
iPad case, and headphones. But it is likely that the user clicked
on these three products at random. Then simply modeling it in a
compulsory sequential order loses some overall contextual infor-
mation. Because future data (interactions that occur after the target
interaction) also provide rich collaborative information to assist
the model training. Therefore, it’s reasonable to leverage the future
data to train better sequential recommendation models.
Recently, researchers have proven that leveraging both past and
future contextual information during training will signicantly
boost recommendation performances compared to the autoregres-
sive models [
28
,
36
]. For example, inspired by the advances in the
eld of natural language processing (NLP), BERT4Rec [
28
] employs
a masked language model (MLM) training objective which predicts
masked items based on both historical and future behavior records
during training (Figure 1b). BERT4Rec signicantly improves rec-
ommendation performances compared to its unidirectional autore-
gressive counterpart SASRec [14].
Despite the richer contextual information brought by training
with future interaction data, simply adopting MLM objectives for
SR can introduce a severe
training-inference gap
. Specically, at
training time, the MLM model predicts masked items with both
past and future interactions as context, which can be illustrated
as
𝑃(𝑖|x𝑝𝑎𝑠𝑡 ,x𝑓 𝑢𝑡𝑢𝑟 𝑒 )
. While at inference, only past behaviors are
available available for prediction, i.e.,
𝑃(𝑖|x𝑝𝑎𝑠𝑡 ,NULL)
. The discrep-
ancy of context between training and inference can bias the model
during inference and lead to potential performance degradation.
To exploit richer contextual information from the future while al-
leviating the potential training-inference gap. The following model
desiderata should be met:
Past-future disentanglement
: The training-inference gap in
existing methods is caused by the use of a single encoder predictor
that entangles past and future contextual information, thus mess-
ing with inference. Instead, the future data should be modeled
in a separate way without explicitly interfering with modeling
historical interaction data. If both disentangled encoders get well-
trained, the absence of future information will not degrade the
performance of the past information encoder. By this means, we
can use only past behaviors for inference, with a minimal gap
between training and inference.
Past-future mutual enhancement
: Users’ interests captured
by past and future behaviors are closely related and comple-
mentary. Simply separating past and future modeling processes
hinders leveraging knowledge learned from each other. To better
exploit future data, an elegant way is to have the disentangled
past-future modeling process mutually enhance each other.
In this article, we propose a framework for better utilization of
past and future information in sequential recommendation, named
DualRec
. To alleviate the training-inference gap, DualRec adopts
a dual network structure. For a target interaction, past and future
contextual behaviors are modeled by two encoders, respectively.
The two encoders perform dual tasks, i.e. the past encoder performs
next-item prediction (primal) while the future encoder performs
previous-item-prediction (dual). Future information is decoupled
from the modeling of past information in this way. During inference,
only the past encoder is used to make predictions, thus avoiding the
training-inference gap. Secondly, dual network enhance each other
through a multi-scale knowledge transferring. Specically, the in-
ternal representations of two networks are constraint to alignment
based on the assumption that users’ interests captured by past and
future behaviors are closely related and complementary. Finally, as
a general framework, DualRec can be instantiated using dierent
backbone models, including RNN, Transformer and lter-based
MLP.
To summarize, our contributions are as follows:
We highlight the training-inference gap existed in sequential rec-
ommendation models when leveraging future data. To handle this
problem, we propose a novel framework DualRec that achieves
the disentanglement and mutual enhancement of past-future
modeling.
DualRec explicitly decouples the past information and future
information modeling into two separate encoders, thus alleviat-
ing the training-inference gap, and further using a past-future
knowledge transferring to learn an enhanced representation.
We conduct comprehensive experiments on four public datasets.
Experimental results demonstrate the eectiveness of our pro-
posed DualRec as compared with several baseline models. Further
analysis illustrates its compatibility.
2 RELATED WORK
Sequential recommenders are designed to model the sequential dy-
namics in user behaviors. Early eorts leverage the Markov Chain
(MC) assumption [
25
] and model item-item transition to predict
user’s next action based on the last visited items. Recently, dif-
ferent neural network-based models have been applied, including
Recurrent Neural Networks (RNNs) [
20
], Convolutional Neural
Networks (CNNs) [
17
], Attention Networks [
31
] and Graph Neural
Networks [
16
]. GRU4Rec [
10
] is a pioneering work that employs
RNN to capture the dynamic characteristic of user behaviors. Hidasi
and Karatzoglou further extends GRU4Rec with enhanced ranking
functions as well as eective sampling strategies. Another line of re-
search is based on CNN. Caser [
29
] treats the embedding matrix of
items as a 2D image and models user behavior sequences with con-
volution. The main advantage of CNN-based models is that they are
much easier to be parallelized on GPUs compared with RNN-based
Disentangling Past-Future Modeling in Sequential Recommendation via Dual Networks CIKM ’22, October 17–21, 2022, Atlanta, GA, USA
models. Self-Attention Mechanism and the Transformer architec-
tures [
31
] are applied to sequential recommendation and proved
to be advantageous in discovering user behavior patterns due to
the adaptive weight learning and better integration of long-range
dependencies, such as SASRec [
14
], BERT4Rec [
28
] and S3-Rec [
38
].
More recently, Graph Neural Networks have been explored to en-
code the global structure of user interactions and capture complex
item transitions. SRGNN [
33
] and GCSAN [
35
] model sequence
or session as graph structure, and use graph neural networks and
attention mechanism to capture the rich dependencies. Inspired by
work related to self-supervised learning, S3-Rec [
38
] and CL4Rec
[
34
] used a contrastive learning approach to pre-train the sequential
recommendation model. Furthermore, CLEA [
22
] uses a contrastive
learning approach to automatically lter out items from the user
sequence that are more irrelevant to the target.
Among these sequential recommendation works, some bidirec-
tional Transformer-based methods adopting MLM objectives, such
as BERT4Rec[
28
], attempt to utilize both the past and future con-
textual information. However, these works suer from a severe
training-inference gap as analyzed in Section 1.
3 METHOD
In this section, we rst dene the sequential recommendation prob-
lem (Section 3.1). Then, we elaborate on the technical details of our
proposed DualRec framework. The DualRec framework is shown
in Figure 2 (b). We rst introduce the base encoder (Section 3.2),
which is a Transformer-based backbone. Then to utilize future con-
text while alleviating the training-inference gap, we present the
dual-network structure (Section 3.3) that models past and future
behaviors through two encoders, respectively. Furthermore, the
bi-directional information transferring mechanism (Section 3.4) is
adopted to make the dual networks enhance each other. The de-
tails in model training and inference are presented in Section 3.5.
Further, we discuss the complexity and compatibility of DualRec
(Section 3.6). Essentially, DualRec can work in cooperation with
most existing sequential recommendation models.
3.1 Problem Formulation
Sequential recommendation learns to predict users’ next behav-
ior from their historical behavior sequences. Given a set of users
U={𝑢1, 𝑢2, ...,𝑢| U | }
, and a set of items
I={𝑖1, 𝑖2, ..., 𝑖 | I | }
, let
S(𝑢)=[𝑖(𝑢)
1, 𝑖 (𝑢)
2, ..., 𝑖 (𝑢)
𝑇𝑢]
denote the chronologically sorted be-
havior sequence of user
𝑢∈ U
, where
𝑖(𝑢)
𝑡
is
𝑡
-th interacted item
of user
𝑢
, and
𝑇𝑢=|S (𝑢)|
is the length of behavior sequence. A
sequential recommendation model predicts the next item
𝑖(𝑢)
𝑇𝑢+1
that
user
𝑢
will interact with based on the behavior history
S(𝑢)
, which
can be formulated as:
𝑝(𝑖(𝑢)
𝑇𝑢+1=𝑖(𝑐)|S(𝑢))=SeqRecModel(S(𝑢), 𝑖 (𝑐)),(1)
where
𝑖(𝑐)
is the candidate item, and
SeqRecModel(·,·)
is a sequen-
tial recommendation model.
3.2 Base Encoder
For simplicity, we illustrate with the standard Transformer as the
base encoder of the dual network structure, which is widely used in
many sequential recommendation methods [
14
,
28
,
38
] and has been
proven to be advantageous in discovering sequential patterns due
to the adaptive weight learning. Notably, our proposed framework
can also work with other backbone architectures, including RNNs
and CNNs. We will evaluate the performance of our framework
with dierent backbones in the experimental section.
3.2.1 Embedding Layer. The input user behavior sequence
𝑆(𝑢)
is
rst transformed into a xed-length sequence
𝑠=(𝑖1, 𝑖2, ..., 𝑖𝑛)
(For
simplicity, we omit the superscript
𝑢
), where
𝑛
is the predened
maximum length. Specically, if the original sequence length is
greater than
𝑛
, we keep the most recent
𝑛
actions. If the original
sequence length is less than
𝑛
, special
[padding]
tokens are padded
to the left of the sequence as dummy past interactions.
Item Embedding:
For all items given in
I
, we create an item
embedding matrix
EIR|I |×𝑑
where
𝑑
is the embedding size.
User behavior sequence 𝑠=(𝑖1, 𝑖2, ..., 𝑖𝑛)is embedded as:
X(0)=(e1,· · · ,e𝑛),e𝑘=LookUp(𝑖𝑘,EI),(2)
where
LookUp(·,·)
retrieves an item embedding from the embed-
ding matrix, and always gets a constant zero vector for the padding
item.
Relative Positional Embedding:
Transformer doesn’t contain
any recurrent or convolutional operation and is not sensitive to
positional information. Therefore, we need to inject additional
positional embedding. We adopt the relative positional embedding
as it is more robust than absolute positional embedding [
27
]. When
calculating the attention weight of the
𝑗
-th item given the
𝑖
-th item
as query, the relative positional embedding is calculated as
p(𝑖, 𝑗)=LookUp(Dist(𝑖, 𝑗),EP),(3)
where
Dist(𝑖, 𝑗) ∈ [𝑛+
1
, 𝑛
1
]
is the relative distance between
two items.
EPR(2𝑛1)×
is the positional embedding where
is the number of heads in the self-attention mechanism. The
relative positional matrix
p
is added to the attention matrix to
inject additional relative positional information.
3.2.2 Transformer Layer. After the embedding layer, the input
XR𝑛×𝑑
are fed to multi-head self-attention (MSA) blocks for
capturing the relations among dierent items in this sequence. The
computation for each head can be formulated as follows:
head𝑖=𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (XW𝑄
𝑖)·(XW𝐾
𝑖)
𝑑/!(XW𝑉
𝑖),(4)
where
W𝑄
𝑖R𝑑×𝑑
,
W𝐾
𝑖R𝑑×𝑑
and
W𝑉
𝑖R𝑑×𝑑
are the query,
key, and value projection matrices, respectively,
(·)
is the matrix
multiplication operator,
𝑖
indicates a specic head,
is the total
number of heads and 𝑑/is the scaling factor.
MSA performs the above self-attention operation
times in
parallel, then combines the outputs of each head together, and
linearly projects them to get the nal output:
MSA(X)=Concat(head𝑖,head2, ..., head)W𝑜,(5)
where W𝑜R𝑑×𝑑is the corresponding transformation matrix.
To introduce non-linearity and perform feature transformation
between MSA layers, point-wise feed-forward (PFF) layer is used:
PFF(X)=FC(𝜎(FC(X))),FC(X)=XW +b,(6)
摘要:

DisentanglingPast-FutureModelinginSequentialRecommendationviaDualNetworksHengyuZhang∗zhang-hy21@mails.tsinghua.edu.cnTsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversityShenzhen,ChinaEnmingYuan∗yem19@mails.tsinghua.edu.cnInstituteforInterdisciplinaryInformationSciences,TsinghuaUniversityBe...

展开>> 收起<<
Disentangling Past-Future Modeling in Sequential Recommendation via Dual Networks.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.29MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注