
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Yin Zhang, et al.
cost. We draw our inspiration from recent work with great suc-
cess in computer vision [
13
]. Its core idea is that representation
learning and classication learning require dierent data distribu-
tions, where traditionally models don’t consider such decoupling
for model parameters. Specically, they propose a two-stage decou-
pling training strategy, where the rst stage trains on the original
long-tail distribution for item representation learning, and the sec-
ond stage trains on the re-balanced data to improve the predictions
of tail items. However, in recommendation application, we empir-
ically observe that these methods suer from a severe forgetting
issue [
21
]. This means that the learned knowledge of certain parts
of items (e.g. head) in the rst training stage are easily forgotten
when the learning focus is shifted to other items (e.g. tail) in the
second training stage, leading to a degradation in overall model
quality (as shown in Figure 1). Moreover, in large-scale production
systems, two-stage training is much more complex to achieve and
maintain than co-training scheme. In light of the pros and cons of
this method, we target on developing a model that improves the
decoupling technique to accommodate web-scale recommenders.
Like many other methods tackling cold-start problems, the de-
coupling methods potentially hurt the overall recommendation
performance. We attempt to understand this from a theoretical
point of view. In particular, we found that the prediction of user
preference towards an item is biased. The bias comes from the dif-
ferences between training and serving data in two perspectives:
1) the item distributions, and 2) user’s preference given an item.
Most existing methods mainly attempt to reduce the bias from the
item distribution perspective, ignoring the discrepancy from user
preference given an item.
Motivated by the theoretical ndings, we propose a novel Cross
Decoupling Network (CDN) framework to mitigate the forgetting
issue, by considering decoupling from both item and user sides.
“Decoupling” means we treat the corresponding learning as two
independent processes. In more details:
•
On the item side, the amount of user feedback received by
head items and tail items vary signicantly. This variations
may cause the model to forget the learned knowledge of head
(more memorization), when its attention is shifted to tail (more
generalization)
1
. Hence, we propose to decouple the memo-
rization and generalization for the item representation learning.
In particular, we rst group features into memorization and
generalization related features, and feed them separately into
a memorization-focused expert and a generalization-focused
expert, which are then aggregated through a frequency-based
gating. This mixture-of-expert structure allows us to dynami-
cally balance memorization and generalization abilities for head
and tail items.
•
On the user side, we leverage a regularized bilateral branch
network to decouple user samples from two distributions. The
network consists of two branches: a “main” branch that trains
on the original distribution for a high-quality representation
learning; and a new “regularizer” branch that trains on the re-
balanced distribution to add more tail information to the model.
1
Memorization is to learn and exploit the existing knowledge of visited training data.
Generalization, on the other hand, is to explore new knowledge that has not occurred
in the training dataset based on the transitivity (e.g. data correlation) [5].
Figure 1: The recommender performance (HR@50) of dier-
ent methods on tail items (x-axis) and overall items (y-axis).
Dots and plus signals represents four two-stage decoupling
methods. ‘H’ means focusing on head items, ‘T’ means focus-
ing on tail items, and
→
means switching from the 1st to the
2nd stage. When tail (T) is focused in the second stage, the
performance on tail items improves; however the overall per-
formance signicantly degrades. Our model CDN achieves
excellent performance for both overall and tail item perfor-
mances.
These two branches share some hidden layers and are jointly
trained to mitigate the forgetting issue. Shared tower on the
user side are used for scalability.
Finally, a new adapter (called
𝛾
-adapter) is introduced to aggregate
the learned vectors from the user and item sides. By adjusting the
hyperparameter
𝛾
in the adapter, we are able to shift the training
attention to tail items in a soft and exible way based on dierent
long-tail distributions.
The resulting model CDN mitigates the forgetting issue of exist-
ing models, where it not only improves tail item performance, but
also preserves or even improves the overall performance. Further,
it adopts a co-training scheme that is easy to maintain. These all
make CDN suitable for industrial caliber applications.
The contributions of this paper are 4-fold:
•
We provide a theoretical understanding on how the long-tail
distribution inuences recommendation performance from both
item and user perspectives.
•
We propose a novel cross decoupling network that decouples
the learning process of memorization and generalization, and
the sampling strategies. A
𝛾
-adapter is utilized to aggregate the
learning from the two sides.
•
Extensive experimental results on public dataset show CDN
signicantly outperforms the SOTA method, improving perfor-
mance on both overall and tail item recommendations.
•
We further provide a case study of applying CDN to a large-
scale recommender system at Google. We show that CDN is
easy to adapt to real setting and achieves signicant quality
improvements both oine and online.
2 LONG-TAIL DISTRIBUTION IN
RECOMMENDATION AND MOTIVATION
Problem Settings. Our goal is to predict user engagement (e.g.
clicks, installs) of candidates (e.g. videos, apps) that are in long-
tail distributions. We start with formulating the problem: Given
a set of users
U={
1
,
2
, . . . , 𝑚}
, a set of items
I={
1
,
2
, . . . , 𝑛}
and their content information (e.g. item tags, categories). Let
ˆ
𝑑(𝑢, 𝑖)