Incorporating Bias-aware Margins into Contrastive Loss for Collaborative Filtering An ZhangyzWenchang MazXiang WangxTat-Seng Chuayz

2025-05-06 0 0 1.48MB 21 页 10玖币

侵权投诉

Incorporating Bias-aware Margins into Contrastive

Loss for Collaborative Filtering

An Zhang†‡ Wenchang Ma‡Xiang Wang§∗ Tat-Seng Chua†‡

†Sea-NExT Joint Lab

‡National University of Singapore

§University of Science and Technology of China

anzhang@u.nus.edu,e0724290@u.nus.edu,xiangwang1223@gmail.com

dcscts@nus.edu.sg

Abstract

Collaborative ﬁltering (CF) models easily suffer from popularity bias, which makes

recommendation deviate from users’ actual preferences. However, most current

debiasing strategies are prone to playing a trade-off game between head and tail

performance, thus inevitably degrading the overall recommendation accuracy. To

reduce the negative impact of popularity bias on CF models, we incorporate

ias-

aware margins into

ontrastive loss and propose a simple yet effective

BC Loss

where the margin tailors quantitatively to the bias degree of each user-item interac-

tion. We investigate the geometric interpretation of BC loss, then further visualize

and theoretically prove that it simultaneously learns better head and tail represen-

tations by encouraging the compactness of similar users/items and enlarging the

dispersion of dissimilar users/items. Over eight benchmark datasets, we use BC

loss to optimize two high-performing CF models. On various evaluation settings

(i.e., imbalanced/balanced, temporal split, fully-observed unbiased, tail/head test

evaluations), BC loss outperforms the state-of-the-art debiasing and non-debiasing

methods with remarkable improvements. Considering the theoretical guarantee

and empirical success of BC loss, we advocate using it not just as a debiasing

strategy, but also as a standard loss in recommender models. Codes are available at

https://github.com/anzhang314/BC-Loss.

1 Introduction

At the core of leading collaborative ﬁltering (CF) models is the learning of high-quality representations

of users and items from historical interactions. However, most CF models easily suffer from the

popularity bias issue in the interaction data [

]. Speciﬁcally, the training data distribution is

typically long-tailed, e.g., a few head items occupy most of the interactions, whereas the majority

of tail items are unpopular and receive little attention. The CF models built upon the imbalanced

data are prone to learn the popularity bias and even amplify it by over-recommending head items and

under-recommending tail items. As a result, the popularity bias causes the biased representations

with poor generalization ability, making recommendations deviate from users’ actual preferences.

Motivated by concerns of popularity bias, studies on debiasing have been conducted to lift the tail

performance. Unfortunately, most prevalent debiasing strategies focus on the trade-off between

head and tail evaluations (see Table 3), including post-processing re-ranking [

], balanced

training loss [

], sample re-weighting [

], and head bias removal

by causal inference [

]. Worse still, many of them hold some assumptions that are

∗Xiang Wang is the corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11054v2 [cs.IR] 18 Feb 2023

(a) BPR loss (head) (b) Softmax loss (head) (c) IPS-CN [15] (head) (d) BC loss (head)

(e) BPR loss (tail) (f) Softmax loss (tail) (g) IPS-CN [15] (tail) (h) BC loss (tail)

Figure 1: Visualizations of item representations learned by LightGCN [

] on Yelp2018 [

], where

subﬁgures (a-d)/(e-h) depict the identical head/tail user as a green star, while the red and blue points

denote positive and negative items, respectively. In each subﬁgure, the ﬁrst row presents the 3D item

representations projected on the unit sphere, while the second row shows the angle distribution of

items w.r.t. the speciﬁc user and the statistics of mean angles. Compared to other losses, BC loss

learns better head representations (cf. with the smallest mean positive angle, the vast majority of

positive items fall into the group closest to the user) and tail representations (cf. a clear margin exists

between positive and negative items for the tail user). BC loss learns a more reasonable representation

distribution that is locally clustered and globally separated. See more details in Appendix A.1.

infeasible in practice, such as the balanced test distribution is known in advance to guide the

hyperparameters’ adjustment [

], or a small unbiased data is present to train the unbiased model

[

]. Consequently, they pursue improvements on tail items but exacerbate the performance

sacriﬁce of head items, leading to a severe overall performance drop. The trade-off between the head

and tail evaluations results in suboptimal representations, which derails the generalization ability.

In this paper, we conjecture that an ideal debiasing strategy should learn high-quality head and tail

representations with powerful discrimination and generalization abilities, rather than playing a trade-

off game between the head and tail performance. Here we follow the prior studies [

]

to focus on one key ingredient in representation learning: the loss function. Figure 1depicts the

item representations, which is optimized via two non-debiasing losses (BPR [

] and Softmax [

])

and one debiasing loss (IPS-CN [

]). Wherein, representation discrimination is reﬂected in how

well the positive items of a user are apart from the negatives. Our insights are: (1) For a user, the

non-debiasing losses are inadequate to discriminate his/her positive and negative items well, since

their representations are largely overlapped as Figures 4a and 4b show; (2) Although IPS-CN achieves

better discrimination power in the tail group than BPR (cf. positive items get smaller angles to the

ego user in Figure 4g, as compared to Figure 4e), it gets worse discrimination ability in the head (cf.

positive items hold larger angles to the ego user in Figure 4c, as compared to Figure 4a).

Towards this end, we incorporate

ias-aware margins into

ontrastive Loss and devise a simple yet

effective

BC Loss

to guide the head and tail representation learning of CF models. Speciﬁcally, we

ﬁrst employ a bias degree extractor to quantify the inﬂuence of interaction-wise popularity — that

is, how well an interaction is predicted, when only popularity information of the target user and

item is used. Interactions involving inactive users and unpopular items often align with lower bias

degrees, indicating that popularity fails to reﬂect user preference faithfully. In contrast, interactions

with active users and popular items are spurred by the popularity information, thus easily inclining to

high bias degrees. We then move on to train the CF model by converting the bias degrees into the

angular margins between user and item representations. If the bias degree is low, we impose a larger

margin to strongly squeeze the tightness of representations. In contrast, if the bias degree is large, we

exert a small or vanishing margin to reduce the inﬂuences of biased representations. Through this

way, for each ego user’s representation, BC quantitatively controls its bias-aware margins with item

representations — adaptively intensifying the representation similarity among positive items, while

diluting that among negative items. Beneﬁting from stringent and discriminative representations, BC

loss signiﬁcantly improves both head and tail performance.

Furthermore, BC loss has three desirable advantages. First, it has a clear geometric interpretation,

as illustrated in Figure 2. Second, it brings forth a simple but effective mechanism of hard example

mining (See Appendix A.2). Third, we theoretically reveal that BC loss tends to learn a low-entropy

cluster for positive pairs (e.g., compactness of matched users and items) and a high-entropy space

for negative pairs (e.g., dispersion of unmatched users and items) (See Theorem 1). Considering the

theoretical guarantee and empirical effectiveness, we argue that BC loss is not only promising to

alleviate popularity bias, but also suitable as a standard learning strategy in CF.

2 Preliminary of Collaborative Filtering (CF)

Task Formulation.

Personalized recommendation is retrieving a subset of items from a large catalog

to match user preference. Here we consider a typical scenario, collaborative ﬁltering (CF) with

implicit feedback [

], which can be framed as a top-

recommendation problem. Let

O+=

{(u, i)|yui = 1}

be the historical interactions between users

and items

, where

yui = 1

indicates

that user

u∈ U

has adopted item

i∈ I

before. Our goal is to optimize a CF model

ˆy:U×I→R

that latches on user preference towards items.

Modeling Scheme.

Scrutinizing leading CF models [

], we systematize the common

paradigm as a combination of three modules: user encoder

ψ(·)

, item encoder

φ(·)

, and similarity

function

s(·)

. Formally, we depict one CF model as

ˆy(u, i) = s(ψ(u), φ(i))

, where

ψ:U→Rd

and

φ:I→Rd

encode the identity (ID) information of user

and item

into

-dimensional

representations, respectively;

s:Rd×Rd→R

measures the similarity between user and item

representations. In literature, there are various choices of encoders and similarity functions:

•

Common encoders roughly fall into three groups: ID-based (e.g., MF [

], NMF [

], CMN

[

]), history-based (e.g., SVD++ [

], FISM [

], MultVAE [

]), and graph-based (e.g., GCMC

[

], PinSage [

], LightGCN [

]) fashions. Here we select two high-performing encoders, MF

and LightGCN, as the backbone models being optimized.

•

The widely-used similarity functions include dot product [

], cosine similarity [

], and neural

networks [

]. As suggested in the recent study [

], cosine similarity is a simple yet effective

and efﬁcient similarity function in CF models, having achieved strong performance. For better

interpretation, we take a geometric view and denote it by:

s(ψ(u), φ(i)) = ψ(u)>φ(i)

kψ(u)k·kφ(i)k

= cos (ˆ

θui),(1)

in which ˆ

θui is the angle between the user representation ψ(u)and item representation φ(i).

Learning Strategy.

To optimize the model parameters, CF models mostly frame the top-

rec-

ommendation problem into a supervised learning task, and resort to one of three classical learning

strategies: pointwise loss (e.g., binary cross-entropy [

], mean square error [

]), pairwise loss

(e.g., BPR [

], WARP [

]), and softmax loss [

]. Among them, pointwise and pairwise losses are

long-standing and widely-adopted objective functions in CF. However, extensive studies [

]

have analytically and empirically conﬁrmed that using pointwise or pairwise loss is prune to propagate

more information towards the head user-item pairs, which ampliﬁes popularity bias.

Softmax loss is much less explored in CF than its application in other domains like CV [

Recent studies [

] ﬁnd that it inherently conducts hard example mining over multiple

negatives and aligns well with the ranking metric, thus attracting a surge of interest in recommendation.

Hence, we cast the minimization of softmax loss [27] as the representative learning strategy:

L0=−X

(u,i)∈O+

log exp (cos(ˆ

θui)/τ)

exp (cos(ˆ

θui)/τ) + Pj∈Nuexp (cos(ˆ

θuj )/τ),(2)

where

(u, i)∈ O+

is one observed interaction of user

, while

Nu={j|yuj = 0}

is the set of

sampled unobserved items that

did not interact with before;

is the hyper-parameter known as

the temperature in softmax [

]. Nonetheless, modifying softmax loss to enhance the discriminative

power of representations and alleviate the popularity bias remains largely unexplored. Therefore,

our work aims to devise a more generic and broadly-applicable variant of softmax loss for CF tasks,

which can improve the long-tail performance fundamentally.

3 Methodology of BC Loss

On the basis of softmax loss, we devise our BC loss and present its desirable characteristics.

3.1 Popularity Bias Extractor

Before mitigating popularity bias, we need to quantify the inﬂuence of popularity bias on a single

user-item pair. One straightforward solution is to compare the performance difference between

the biased and unbiased evaluations. However, this is not feasible as the unbiased data is usually

unavailable in practice. Statistical metrics of popularity could be a reasonable proxy of the biased

information, such as user popularity statistics

pu∈P

(i.e., the number of historical items that user

has interacted with before) and item popularity statistics

pi∈P

(i.e., the number of observed

interactions that item

is involved in). If the impact of the interaction between

and

can be

captured well based solely on such statistics, the model is susceptible to exploiting popularity bias for

prediction. Hence, we argue that the popularity-only prediction will delineate the inﬂuence of bias.

Towards this end, we ﬁrst train an additional module, termed popularity bias extractor, which only

takes the popularity statistics as input to make prediction. Similar to the modeling of CF (cf. Section

2), the bias extractor is formulated as a function ˆyb:P×P→R:

ˆyb(pu, pi) = s(ψb(pu), φb(pi)) .

=cos(ˆ

ξui),(3)

where the user popularity encoder

ψb:P→Rd

and the item popularity encoder

φb:P→Rd

map the popularity statistics of user

and item

into

-dimensional popularity embeddings

ψb(pu)

and

φb(pi)

, respectively;

s:Rd×Rd→R

is the cosine similarity function between popularity

embeddings (cf. Equation (1)). ˆ

ξui is the angle between ψb(pu)and φb(pi).

We then minimize the following softmax loss to optimize the popularity bias extractor:

Lb=−X

(u,i)∈O+

log exp (cos(ˆ

ξui)/τ)

exp (cos(ˆ

ξui)/τ) + Pj∈Nuexp (cos(ˆ

ξuj )/τ).(4)

This optimization enforces the extractor to reconstruct the historical interactions using only biased

information (i.e., popularity statistics) and makes the reconstruction reﬂect the interaction-wise bias

degree. As shown in Appendix B.5, interactions with active users and popular items are inclining

to learn well via Equation

(4)

. Furthermore, we can distinguish hard interactions based on the bias

degree, i.e., the interactions that can be hardly predicted by popularity statistics ought to be more

informative for representation learning in the target CF model. In a nutshell, the popularity bias

extractor underscores the bias degree of each user-item interaction, which substantively reﬂects how

hard it is to be predicted.

3.2 BC Loss

We move on to devise a new BC loss for the target CF model. Our BC loss stems from softmax loss but

converts the interaction-bias degrees into the bias-aware angular margins among the representations

to enhance the discriminative power of representations. Our BC loss is:

LBC =−X

(u,i)∈O+

log exp (cos(ˆ

θui +Mui)/τ)

exp (cos(ˆ

θui +Mui)/τ) + Pj∈Nuexp (cos(ˆ

θuj )/τ),(5)

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

𝑴𝒖𝒊

𝟐

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

(a) Softmax Loss (2D)

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

𝑴𝒖𝒊

𝟐

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

(b) BC Loss (2D)

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

𝑴𝒖𝒊

𝟐

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

𝑴𝒖𝒊

𝟐

𝒌

𝒋

𝑜

𝒊

𝒌

𝒋

𝑜

𝒊

𝑴𝒖𝒊

𝟐

(d) BC Loss (3D)

Figure 2: Geometric Interpretation of softmax loss and BC loss in 2D and 3D hypersphere. The dark

red region indicates the discriminative user constraint, while the light red region is for comparison.

where Mui is the bias-aware angular margin for the interaction (u, i)deﬁned as:

Mui = min{ˆ

ξui, π −ˆ

θui}(6)

where

ξui

is derived from the popularity bias extractor (cf. Equation

(3)

), and

π−ˆ

θui

is the upper

bound to restrict

cos(·+Mui)

to be a monotonically decreasing function. Intuitively, if a user-item

pair

(u, i)

is the hard interaction that can hardly be reconstructed by its popularity statistics, it holds a

high value of

ξui

and leads to a high value of

Mui

; henceforward, BC loss imposes the large angular

margin

Mui

between the negative item

and positive item

and optimizes the representations of user

uand item ito lower ˆ

ξui. See more details and analyses in Section 4.

It is noted that BC loss is extremely easy to implement in recommendation tasks, which only

needs to revise several lines of code. Moreover, compared with softmax loss, BC loss only adds

negligible computational complexity during training (cf. Table 5) but achieves more discriminative

representations. Hence, we recommend to use BC loss not only as a debiasing strategy to alleviate

the popularity bias, but also as a standard loss in recommender models to enhance the discriminative

power. Note that the modeling of

Mui

is worth exploring, such as the more complex version

Mui = min{λ·ˆ

ξui, π −ˆ

θui}

where

controls the strength of the bias-margin. Meanwhile, carefully

designing a monotonically decreasing function helps to get rid of the upper bound restriction. We

will leave the exploration of bias-margin in future work.

4 Analyses of BC Loss

We analyze desirable characteristics of BC loss. Speciﬁcally, we start by presenting its geometric inter-

pretation, and then show its theoretical properties w.r.t. compactness and dispersion of representations.

The hard mining mechanism of BC loss is discussed in Appendix A.2.

4.1 Geometric Interpretation

Here we probe into the ranking criteria of softmax loss and BC loss, from the geometric perspective.

To simplify the geometric interpretation, we analyze one user

with one observed item

and

only two unobserved items

and

. Then the posterior probabilities obtained by softmax loss are:

exp (cos(ˆ

θui)/τ)

exp (cos(ˆ

θui)/τ)+exp (cos(ˆ

θuj )/τ)+exp (cos(ˆ

θuk )/τ)

. During training, softmax loss encourages the ranking

criteria

θui <ˆ

θuj

and

θui <ˆ

θuk

to model the basic assumption that the observed interaction

(u, i)

indicates more positive cues of user preference than the unobserved interactions (u, j)and (u, k).

Intuitively, to make the ranking criteria more stringent, we can impose an angular margin

Mui

on it

and establish a new criteria

θui +Mui <ˆ

θuj

and

θui +Mui <ˆ

θuk

. Directly formulating this idea

arrives at the posterior probabilities of BC loss:

exp (cos(ˆ

θui+Mui)/τ)

exp (cos(ˆ

θui+Mui)/τ)+exp (cos(ˆ

θuj )/τ)+exp (cos(ˆ

θuk )/τ)

Obviously, BC loss is more rigorous about the ranking assumption compared with softmax loss. See

Appendix A.2 for more detailed explanations.

We then depict the geometric interpretation and comparison of softmax loss and BC loss in Figure 2.

Assume the learned representations of

, and

are given, and softmax and BC losses are optimized

to the same value. In softmax loss, the constraint boundaries for correctly ranking user

’s preference

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IncorporatingBias-awareMarginsintoContrastiveLossforCollaborativeFilteringAnZhangyzWenchangMazXiangWangxTat-SengChuayzySea-NExTJointLabzNationalUniversityofSingaporexUniversityofScienceandTechnologyofChinaanzhang@u.nus.edu,e0724290@u.nus.edu,xiangwang1223@gmail.comdcscts@nus.edu.sgAbstractCollabora...

展开>> 收起<<

Incorporating Bias-aware Margins into Contrastive Loss for Collaborative Filtering An ZhangyzWenchang MazXiang WangxTat-Seng Chuayz.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Incorporating Bias-aware Margins into Contrastive Loss for Collaborative Filtering An ZhangyzWenchang MazXiang WangxTat-Seng Chuayz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: