
high bias degrees. We then move on to train the CF model by converting the bias degrees into the
angular margins between user and item representations. If the bias degree is low, we impose a larger
margin to strongly squeeze the tightness of representations. In contrast, if the bias degree is large, we
exert a small or vanishing margin to reduce the influences of biased representations. Through this
way, for each ego user’s representation, BC quantitatively controls its bias-aware margins with item
representations — adaptively intensifying the representation similarity among positive items, while
diluting that among negative items. Benefiting from stringent and discriminative representations, BC
loss significantly improves both head and tail performance.
Furthermore, BC loss has three desirable advantages. First, it has a clear geometric interpretation,
as illustrated in Figure 2. Second, it brings forth a simple but effective mechanism of hard example
mining (See Appendix A.2). Third, we theoretically reveal that BC loss tends to learn a low-entropy
cluster for positive pairs (e.g., compactness of matched users and items) and a high-entropy space
for negative pairs (e.g., dispersion of unmatched users and items) (See Theorem 1). Considering the
theoretical guarantee and empirical effectiveness, we argue that BC loss is not only promising to
alleviate popularity bias, but also suitable as a standard learning strategy in CF.
2 Preliminary of Collaborative Filtering (CF)
Task Formulation.
Personalized recommendation is retrieving a subset of items from a large catalog
to match user preference. Here we consider a typical scenario, collaborative filtering (CF) with
implicit feedback [
28
], which can be framed as a top-
N
recommendation problem. Let
O+=
{(u, i)|yui = 1}
be the historical interactions between users
U
and items
I
, where
yui = 1
indicates
that user
u∈ U
has adopted item
i∈ I
before. Our goal is to optimize a CF model
ˆy:U×I→R
that latches on user preference towards items.
Modeling Scheme.
Scrutinizing leading CF models [
26
,
25
,
29
,
30
], we systematize the common
paradigm as a combination of three modules: user encoder
ψ(·)
, item encoder
φ(·)
, and similarity
function
s(·)
. Formally, we depict one CF model as
ˆy(u, i) = s(ψ(u), φ(i))
, where
ψ:U→Rd
and
φ:I→Rd
encode the identity (ID) information of user
u
and item
i
into
d
-dimensional
representations, respectively;
s:Rd×Rd→R
measures the similarity between user and item
representations. In literature, there are various choices of encoders and similarity functions:
•
Common encoders roughly fall into three groups: ID-based (e.g., MF [
26
,
29
], NMF [
31
], CMN
[
32
]), history-based (e.g., SVD++ [
29
], FISM [
33
], MultVAE [
30
]), and graph-based (e.g., GCMC
[
34
], PinSage [
35
], LightGCN [
25
]) fashions. Here we select two high-performing encoders, MF
and LightGCN, as the backbone models being optimized.
•
The widely-used similarity functions include dot product [
26
], cosine similarity [
36
], and neural
networks [
31
]. As suggested in the recent study [
36
], cosine similarity is a simple yet effective
and efficient similarity function in CF models, having achieved strong performance. For better
interpretation, we take a geometric view and denote it by:
s(ψ(u), φ(i)) = ψ(u)>φ(i)
kψ(u)k·kφ(i)k
.
= cos (ˆ
θui),(1)
in which ˆ
θui is the angle between the user representation ψ(u)and item representation φ(i).
Learning Strategy.
To optimize the model parameters, CF models mostly frame the top-
N
rec-
ommendation problem into a supervised learning task, and resort to one of three classical learning
strategies: pointwise loss (e.g., binary cross-entropy [
37
], mean square error [
29
]), pairwise loss
(e.g., BPR [
26
], WARP [
38
]), and softmax loss [
28
]. Among them, pointwise and pairwise losses are
long-standing and widely-adopted objective functions in CF. However, extensive studies [
9
,
1
,
39
]
have analytically and empirically confirmed that using pointwise or pairwise loss is prune to propagate
more information towards the head user-item pairs, which amplifies popularity bias.
Softmax loss is much less explored in CF than its application in other domains like CV [
40
,
41
].
Recent studies [
36
,
42
,
43
,
44
,
45
] find that it inherently conducts hard example mining over multiple
negatives and aligns well with the ranking metric, thus attracting a surge of interest in recommendation.
3