Incorporating Bias-aware Margins into Contrastive Loss for Collaborative Filtering An ZhangyzWenchang MazXiang WangxTat-Seng Chuayz

2025-05-06 0 0 1.48MB 21 页 10玖币
侵权投诉
Incorporating Bias-aware Margins into Contrastive
Loss for Collaborative Filtering
An Zhang†‡ Wenchang MaXiang Wang§Tat-Seng Chua†‡
Sea-NExT Joint Lab
National University of Singapore
§University of Science and Technology of China
anzhang@u.nus.edu,e0724290@u.nus.edu,xiangwang1223@gmail.com
dcscts@nus.edu.sg
Abstract
Collaborative filtering (CF) models easily suffer from popularity bias, which makes
recommendation deviate from users’ actual preferences. However, most current
debiasing strategies are prone to playing a trade-off game between head and tail
performance, thus inevitably degrading the overall recommendation accuracy. To
reduce the negative impact of popularity bias on CF models, we incorporate
B
ias-
aware margins into
C
ontrastive loss and propose a simple yet effective
BC Loss
,
where the margin tailors quantitatively to the bias degree of each user-item interac-
tion. We investigate the geometric interpretation of BC loss, then further visualize
and theoretically prove that it simultaneously learns better head and tail represen-
tations by encouraging the compactness of similar users/items and enlarging the
dispersion of dissimilar users/items. Over eight benchmark datasets, we use BC
loss to optimize two high-performing CF models. On various evaluation settings
(i.e., imbalanced/balanced, temporal split, fully-observed unbiased, tail/head test
evaluations), BC loss outperforms the state-of-the-art debiasing and non-debiasing
methods with remarkable improvements. Considering the theoretical guarantee
and empirical success of BC loss, we advocate using it not just as a debiasing
strategy, but also as a standard loss in recommender models. Codes are available at
https://github.com/anzhang314/BC-Loss.
1 Introduction
At the core of leading collaborative filtering (CF) models is the learning of high-quality representations
of users and items from historical interactions. However, most CF models easily suffer from the
popularity bias issue in the interaction data [
1
,
2
,
3
,
4
]. Specifically, the training data distribution is
typically long-tailed, e.g., a few head items occupy most of the interactions, whereas the majority
of tail items are unpopular and receive little attention. The CF models built upon the imbalanced
data are prone to learn the popularity bias and even amplify it by over-recommending head items and
under-recommending tail items. As a result, the popularity bias causes the biased representations
with poor generalization ability, making recommendations deviate from users’ actual preferences.
Motivated by concerns of popularity bias, studies on debiasing have been conducted to lift the tail
performance. Unfortunately, most prevalent debiasing strategies focus on the trade-off between
head and tail evaluations (see Table 3), including post-processing re-ranking [
5
,
6
,
7
,
8
,
9
], balanced
training loss [
10
,
11
,
12
,
9
], sample re-weighting [
13
,
14
,
15
,
16
,
17
,
18
], and head bias removal
by causal inference [
19
,
20
,
21
,
22
]. Worse still, many of them hold some assumptions that are
Xiang Wang is the corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11054v2 [cs.IR] 18 Feb 2023
(a) BPR loss (head) (b) Softmax loss (head) (c) IPS-CN [15] (head) (d) BC loss (head)
(e) BPR loss (tail) (f) Softmax loss (tail) (g) IPS-CN [15] (tail) (h) BC loss (tail)
Figure 1: Visualizations of item representations learned by LightGCN [
25
] on Yelp2018 [
25
], where
subfigures (a-d)/(e-h) depict the identical head/tail user as a green star, while the red and blue points
denote positive and negative items, respectively. In each subfigure, the first row presents the 3D item
representations projected on the unit sphere, while the second row shows the angle distribution of
items w.r.t. the specific user and the statistics of mean angles. Compared to other losses, BC loss
learns better head representations (cf. with the smallest mean positive angle, the vast majority of
positive items fall into the group closest to the user) and tail representations (cf. a clear margin exists
between positive and negative items for the tail user). BC loss learns a more reasonable representation
distribution that is locally clustered and globally separated. See more details in Appendix A.1.
infeasible in practice, such as the balanced test distribution is known in advance to guide the
hyperparameters’ adjustment [
23
,
22
], or a small unbiased data is present to train the unbiased model
[
24
,
19
]. Consequently, they pursue improvements on tail items but exacerbate the performance
sacrifice of head items, leading to a severe overall performance drop. The trade-off between the head
and tail evaluations results in suboptimal representations, which derails the generalization ability.
In this paper, we conjecture that an ideal debiasing strategy should learn high-quality head and tail
representations with powerful discrimination and generalization abilities, rather than playing a trade-
off game between the head and tail performance. Here we follow the prior studies [
15
,
23
,
13
,
14
]
to focus on one key ingredient in representation learning: the loss function. Figure 1depicts the
item representations, which is optimized via two non-debiasing losses (BPR [
26
] and Softmax [
27
])
and one debiasing loss (IPS-CN [
15
]). Wherein, representation discrimination is reflected in how
well the positive items of a user are apart from the negatives. Our insights are: (1) For a user, the
non-debiasing losses are inadequate to discriminate his/her positive and negative items well, since
their representations are largely overlapped as Figures 4a and 4b show; (2) Although IPS-CN achieves
better discrimination power in the tail group than BPR (cf. positive items get smaller angles to the
ego user in Figure 4g, as compared to Figure 4e), it gets worse discrimination ability in the head (cf.
positive items hold larger angles to the ego user in Figure 4c, as compared to Figure 4a).
Towards this end, we incorporate
B
ias-aware margins into
C
ontrastive Loss and devise a simple yet
effective
BC Loss
to guide the head and tail representation learning of CF models. Specifically, we
first employ a bias degree extractor to quantify the influence of interaction-wise popularity — that
is, how well an interaction is predicted, when only popularity information of the target user and
item is used. Interactions involving inactive users and unpopular items often align with lower bias
degrees, indicating that popularity fails to reflect user preference faithfully. In contrast, interactions
with active users and popular items are spurred by the popularity information, thus easily inclining to
2
high bias degrees. We then move on to train the CF model by converting the bias degrees into the
angular margins between user and item representations. If the bias degree is low, we impose a larger
margin to strongly squeeze the tightness of representations. In contrast, if the bias degree is large, we
exert a small or vanishing margin to reduce the influences of biased representations. Through this
way, for each ego user’s representation, BC quantitatively controls its bias-aware margins with item
representations — adaptively intensifying the representation similarity among positive items, while
diluting that among negative items. Benefiting from stringent and discriminative representations, BC
loss significantly improves both head and tail performance.
Furthermore, BC loss has three desirable advantages. First, it has a clear geometric interpretation,
as illustrated in Figure 2. Second, it brings forth a simple but effective mechanism of hard example
mining (See Appendix A.2). Third, we theoretically reveal that BC loss tends to learn a low-entropy
cluster for positive pairs (e.g., compactness of matched users and items) and a high-entropy space
for negative pairs (e.g., dispersion of unmatched users and items) (See Theorem 1). Considering the
theoretical guarantee and empirical effectiveness, we argue that BC loss is not only promising to
alleviate popularity bias, but also suitable as a standard learning strategy in CF.
2 Preliminary of Collaborative Filtering (CF)
Task Formulation.
Personalized recommendation is retrieving a subset of items from a large catalog
to match user preference. Here we consider a typical scenario, collaborative filtering (CF) with
implicit feedback [
28
], which can be framed as a top-
N
recommendation problem. Let
O+=
{(u, i)|yui = 1}
be the historical interactions between users
U
and items
I
, where
yui = 1
indicates
that user
u∈ U
has adopted item
i∈ I
before. Our goal is to optimize a CF model
ˆy:U×IR
that latches on user preference towards items.
Modeling Scheme.
Scrutinizing leading CF models [
26
,
25
,
29
,
30
], we systematize the common
paradigm as a combination of three modules: user encoder
ψ(·)
, item encoder
φ(·)
, and similarity
function
s(·)
. Formally, we depict one CF model as
ˆy(u, i) = s(ψ(u), φ(i))
, where
ψ:URd
and
φ:IRd
encode the identity (ID) information of user
u
and item
i
into
d
-dimensional
representations, respectively;
s:Rd×RdR
measures the similarity between user and item
representations. In literature, there are various choices of encoders and similarity functions:
Common encoders roughly fall into three groups: ID-based (e.g., MF [
26
,
29
], NMF [
31
], CMN
[
32
]), history-based (e.g., SVD++ [
29
], FISM [
33
], MultVAE [
30
]), and graph-based (e.g., GCMC
[
34
], PinSage [
35
], LightGCN [
25
]) fashions. Here we select two high-performing encoders, MF
and LightGCN, as the backbone models being optimized.
The widely-used similarity functions include dot product [
26
], cosine similarity [
36
], and neural
networks [
31
]. As suggested in the recent study [
36
], cosine similarity is a simple yet effective
and efficient similarity function in CF models, having achieved strong performance. For better
interpretation, we take a geometric view and denote it by:
s(ψ(u), φ(i)) = ψ(u)>φ(i)
kψ(u)k·kφ(i)k
.
= cos (ˆ
θui),(1)
in which ˆ
θui is the angle between the user representation ψ(u)and item representation φ(i).
Learning Strategy.
To optimize the model parameters, CF models mostly frame the top-
N
rec-
ommendation problem into a supervised learning task, and resort to one of three classical learning
strategies: pointwise loss (e.g., binary cross-entropy [
37
], mean square error [
29
]), pairwise loss
(e.g., BPR [
26
], WARP [
38
]), and softmax loss [
28
]. Among them, pointwise and pairwise losses are
long-standing and widely-adopted objective functions in CF. However, extensive studies [
9
,
1
,
39
]
have analytically and empirically confirmed that using pointwise or pairwise loss is prune to propagate
more information towards the head user-item pairs, which amplifies popularity bias.
Softmax loss is much less explored in CF than its application in other domains like CV [
40
,
41
].
Recent studies [
36
,
42
,
43
,
44
,
45
] find that it inherently conducts hard example mining over multiple
negatives and aligns well with the ranking metric, thus attracting a surge of interest in recommendation.
3
Hence, we cast the minimization of softmax loss [27] as the representative learning strategy:
L0=X
(u,i)∈O+
log exp (cos(ˆ
θui))
exp (cos(ˆ
θui)) + Pj∈Nuexp (cos(ˆ
θuj )),(2)
where
(u, i)∈ O+
is one observed interaction of user
u
, while
Nu={j|yuj = 0}
is the set of
sampled unobserved items that
u
did not interact with before;
τ
is the hyper-parameter known as
the temperature in softmax [
46
]. Nonetheless, modifying softmax loss to enhance the discriminative
power of representations and alleviate the popularity bias remains largely unexplored. Therefore,
our work aims to devise a more generic and broadly-applicable variant of softmax loss for CF tasks,
which can improve the long-tail performance fundamentally.
3 Methodology of BC Loss
On the basis of softmax loss, we devise our BC loss and present its desirable characteristics.
3.1 Popularity Bias Extractor
Before mitigating popularity bias, we need to quantify the influence of popularity bias on a single
user-item pair. One straightforward solution is to compare the performance difference between
the biased and unbiased evaluations. However, this is not feasible as the unbiased data is usually
unavailable in practice. Statistical metrics of popularity could be a reasonable proxy of the biased
information, such as user popularity statistics
puP
(i.e., the number of historical items that user
u
has interacted with before) and item popularity statistics
piP
(i.e., the number of observed
interactions that item
i
is involved in). If the impact of the interaction between
u
and
i
can be
captured well based solely on such statistics, the model is susceptible to exploiting popularity bias for
prediction. Hence, we argue that the popularity-only prediction will delineate the influence of bias.
Towards this end, we first train an additional module, termed popularity bias extractor, which only
takes the popularity statistics as input to make prediction. Similar to the modeling of CF (cf. Section
2), the bias extractor is formulated as a function ˆyb:P×PR:
ˆyb(pu, pi) = s(ψb(pu), φb(pi)) .
=cos(ˆ
ξui),(3)
where the user popularity encoder
ψb:PRd
and the item popularity encoder
φb:PRd
map the popularity statistics of user
u
and item
i
into
d
-dimensional popularity embeddings
ψb(pu)
and
φb(pi)
, respectively;
s:Rd×RdR
is the cosine similarity function between popularity
embeddings (cf. Equation (1)). ˆ
ξui is the angle between ψb(pu)and φb(pi).
We then minimize the following softmax loss to optimize the popularity bias extractor:
Lb=X
(u,i)∈O+
log exp (cos(ˆ
ξui))
exp (cos(ˆ
ξui)) + Pj∈Nuexp (cos(ˆ
ξuj )).(4)
This optimization enforces the extractor to reconstruct the historical interactions using only biased
information (i.e., popularity statistics) and makes the reconstruction reflect the interaction-wise bias
degree. As shown in Appendix B.5, interactions with active users and popular items are inclining
to learn well via Equation
(4)
. Furthermore, we can distinguish hard interactions based on the bias
degree, i.e., the interactions that can be hardly predicted by popularity statistics ought to be more
informative for representation learning in the target CF model. In a nutshell, the popularity bias
extractor underscores the bias degree of each user-item interaction, which substantively reflects how
hard it is to be predicted.
3.2 BC Loss
We move on to devise a new BC loss for the target CF model. Our BC loss stems from softmax loss but
converts the interaction-bias degrees into the bias-aware angular margins among the representations
to enhance the discriminative power of representations. Our BC loss is:
LBC =X
(u,i)∈O+
log exp (cos(ˆ
θui +Mui))
exp (cos(ˆ
θui +Mui)) + Pj∈Nuexp (cos(ˆ
θuj )),(5)
4
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
𝑴𝒖𝒊
𝟐
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
(a) Softmax Loss (2D)
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
𝑴𝒖𝒊
𝟐
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
(b) BC Loss (2D)
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
𝑴𝒖𝒊
𝟐
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
(c) Softmax Loss (3D)
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
𝑴𝒖𝒊
𝟐
𝒌
𝒋
𝑜
𝒊
𝒌
𝒋
𝑜
𝒊
𝑴𝒖𝒊
𝟐
(d) BC Loss (3D)
Figure 2: Geometric Interpretation of softmax loss and BC loss in 2D and 3D hypersphere. The dark
red region indicates the discriminative user constraint, while the light red region is for comparison.
where Mui is the bias-aware angular margin for the interaction (u, i)defined as:
Mui = min{ˆ
ξui, π ˆ
θui}(6)
where
ˆ
ξui
is derived from the popularity bias extractor (cf. Equation
(3)
), and
πˆ
θui
is the upper
bound to restrict
cos(·+Mui)
to be a monotonically decreasing function. Intuitively, if a user-item
pair
(u, i)
is the hard interaction that can hardly be reconstructed by its popularity statistics, it holds a
high value of
ˆ
ξui
and leads to a high value of
Mui
; henceforward, BC loss imposes the large angular
margin
Mui
between the negative item
j
and positive item
i
and optimizes the representations of user
uand item ito lower ˆ
ξui. See more details and analyses in Section 4.
It is noted that BC loss is extremely easy to implement in recommendation tasks, which only
needs to revise several lines of code. Moreover, compared with softmax loss, BC loss only adds
negligible computational complexity during training (cf. Table 5) but achieves more discriminative
representations. Hence, we recommend to use BC loss not only as a debiasing strategy to alleviate
the popularity bias, but also as a standard loss in recommender models to enhance the discriminative
power. Note that the modeling of
Mui
is worth exploring, such as the more complex version
Mui = min{λ·ˆ
ξui, π ˆ
θui}
where
λ
controls the strength of the bias-margin. Meanwhile, carefully
designing a monotonically decreasing function helps to get rid of the upper bound restriction. We
will leave the exploration of bias-margin in future work.
4 Analyses of BC Loss
We analyze desirable characteristics of BC loss. Specifically, we start by presenting its geometric inter-
pretation, and then show its theoretical properties w.r.t. compactness and dispersion of representations.
The hard mining mechanism of BC loss is discussed in Appendix A.2.
4.1 Geometric Interpretation
Here we probe into the ranking criteria of softmax loss and BC loss, from the geometric perspective.
To simplify the geometric interpretation, we analyze one user
u
with one observed item
i
and
only two unobserved items
j
and
k
. Then the posterior probabilities obtained by softmax loss are:
exp (cos(ˆ
θui))
exp (cos(ˆ
θui))+exp (cos(ˆ
θuj ))+exp (cos(ˆ
θuk ))
. During training, softmax loss encourages the ranking
criteria
ˆ
θui <ˆ
θuj
and
ˆ
θui <ˆ
θuk
to model the basic assumption that the observed interaction
(u, i)
indicates more positive cues of user preference than the unobserved interactions (u, j)and (u, k).
Intuitively, to make the ranking criteria more stringent, we can impose an angular margin
Mui
on it
and establish a new criteria
ˆ
θui +Mui <ˆ
θuj
and
ˆ
θui +Mui <ˆ
θuk
. Directly formulating this idea
arrives at the posterior probabilities of BC loss:
exp (cos(ˆ
θui+Mui))
exp (cos(ˆ
θui+Mui))+exp (cos(ˆ
θuj ))+exp (cos(ˆ
θuk ))
.
Obviously, BC loss is more rigorous about the ranking assumption compared with softmax loss. See
Appendix A.2 for more detailed explanations.
We then depict the geometric interpretation and comparison of softmax loss and BC loss in Figure 2.
Assume the learned representations of
i
,
j
, and
k
are given, and softmax and BC losses are optimized
to the same value. In softmax loss, the constraint boundaries for correctly ranking user
u
s preference
5
摘要:

IncorporatingBias-awareMarginsintoContrastiveLossforCollaborativeFilteringAnZhangyzWenchangMazXiangWangxTat-SengChuayzySea-NExTJointLabzNationalUniversityofSingaporexUniversityofScienceandTechnologyofChinaanzhang@u.nus.edu,e0724290@u.nus.edu,xiangwang1223@gmail.comdcscts@nus.edu.sgAbstractCollabora...

展开>> 收起<<
Incorporating Bias-aware Margins into Contrastive Loss for Collaborative Filtering An ZhangyzWenchang MazXiang WangxTat-Seng Chuayz.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.48MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注