Understanding or Manipulation Rethinking Online Performance Gains of Modern Recommender Systems

2025-05-02 0 0 4.14MB 33 页 10玖币
侵权投诉
Understanding or Manipulation: Rethinking Online
Performance Gains of Modern Recommender Systems
ZHENGBANG ZHU, Shanghai Jiao Tong University, China
RONGJUN QIN, National Key Laboratory for Novel Software Technology, Nanjing University, China
and Polixir Technologies, China
JUNJIE HUANG, Shanghai Jiao Tong University, China
XINYI DAI, Shanghai Jiao Tong University, China
YANG YU
,National Key Laboratory for Novel Software Technology, Nanjing University, China and Polixir
Technologies, China
YONG YU, Shanghai Jiao Tong University, China
WEINAN ZHANG,Shanghai Jiao Tong University, China
Recommender systems are expected to be assistants that help human users nd relevant information automati-
cally without explicit queries. As recommender systems evolve, increasingly sophisticated learning techniques
are applied and have achieved better performance in terms of user engagement metrics such as clicks and
browsing time. The increase in the measured performance, however, can have two possible attributions: a
better understanding of user preferences, and a more proactive ability to utilize human bounded rationality to
seduce user over-consumption. A natural following question is whether current recommendation algorithms
are manipulating user preferences. If so, can we measure the manipulation level? In this paper, we present a
general framework for benchmarking the degree of manipulations of recommendation algorithms, in both
slate recommendation and sequential recommendation scenarios. The framework consists of four stages, initial
preference calculation, training data collection, algorithm training and interaction, and metrics calculation that
involves two proposed metrics, Manipulation Score and Preference Shift. We benchmark some representative
recommendation algorithms in both synthetic and real-world datasets under the proposed framework. We
have observed that a high online click-through rate does not necessarily mean a better understanding of
user initial preference, but ends in prompting users to choose more documents they initially did not favor.
Moreover, we nd that the training data have notable impacts on the manipulation degrees, and algorithms
with more powerful modeling abilities are more sensitive to such impacts. The experiments also veried
the usefulness of the proposed metrics for measuring the degree of manipulations. We advocate that future
recommendation algorithm studies should be treated as an optimization problem with constrained user
preference manipulations.
CCS Concepts: Information systems Recommender systems.
Additional Key Words and Phrases: Recommender System, User Model, Bounded Rationality
Corresponding authors.
Authors’ addresses: Zhengbang Zhu, Shanghai Jiao Tong University, China, zhengbangzhu@sjtu.edu.cn; Rongjun Qin,
National Key Laboratory for Novel Software Technology, Nanjing University, China and Polixir Technologies, China,
qinrj@polixir.ai; Junjie Huang, Shanghai Jiao Tong University, China, legend0018@sjtu.edu.cn; Xinyi Dai, Shanghai Jiao
Tong University, China, daixinyi@sjtu.edu.cn; Yang Yu
, National Key Laboratory for Novel Software Technology, Nanjing
University, China and Polixir Technologies, China, yuy@polixir.ai; Yong Yu, Shanghai Jiao Tong University, China, yyu@
apex.sjtu.edu.cn; Weinan Zhang, Shanghai Jiao Tong University, China, wnzhang@sjtu.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 1046-8188/2023/12-ART
https://doi.org/10.1145/3637869
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: December 2023.
arXiv:2210.05662v2 [cs.IR] 18 Dec 2023
2 Z. Zhu et al.
ACM Reference Format:
Zhengbang Zhu, Rongjun Qin, Junjie Huang, Xinyi Dai, Yang Yu
, Yong Yu, and Weinan Zhang
. 2023.
Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender Systems.
ACM Trans. Inf. Syst. 1, 1 (December 2023), 33 pages. https://doi.org/10.1145/3637869
1 INTRODUCTION
With the popularity of the Internet and the growth of User Generated Content (UGC) [
50
] during
recent years, we are being overwhelmed by a massive volume of information. To save users
from information overload, recommender systems have been widely applied in today’s short
video [
55
], news [
87
] and e-commerce [
17
] platforms. Dierent from traditional information
retrieval techniques, recommender systems are considered to be more advanced by using user
information to personalize the recommendations. This is accomplished through the prediction of the
relevance between users and documents. There are various recommendation algorithms proposed
for relevance modeling, and their development is evidenced by the progressive improvement in
various oine and online metrics. Oine metrics refer to those that can be evaluated with a static
dataset and without interactions with real users. For example, Normalized Discounted Cumulative
Gain (NDCG) is commonly used in recommendation literature to measure how well the model
ranks more relevant documents at the top of the list. On the other hand, online metrics have to
be computed by deploying the trained model online and gathering feedback from real users. As
a general case of online metrics, Click-Through Rate (CTR) shows how often people click on the
displayed documents by a recommendation algorithm.
While improved online metrics bring more trac and revenue to the platform, it is questionable
whether existing online metrics are aligned with the aspiration of recommender systems, which is
to help people nd relevant or favored content. As revealed by research in behavioral economics,
humans make decisions with bounded rationality [
77
]. Bounded rationality is contrary to expected-
utility models [
75
] where decisions made are always optimal under some expected utility function.
Instead, humans are trying to make optimal decisions, but are bounded by various cognitive
limitations, e.g., limited memory and decision time. Under the bounded rationality framework, a
variety of phenomena have been observed and investigated, such as the decoy eect, conrmation
bias, anchoring eect, etc [
29
,
62
,
81
]. Most of them are possible in online recommendations, which
are already studied by prior works [
1
,
53
,
85
]. Once a recommender system is deployed online, it
will actively interact with human users of bounded rationality. Therefore, we cannot distinguish
whether the active participation of users is due to the system’s accurate grasp of their preferences
or to an over-exploitation of their psychological weaknesses. To our knowledge, there is no existing
work that studies whether the performance gain of current recommendation algorithms in online
metrics comes from the utilization of bounded rationality.
Since intrusive evaluations directly on the production platform can have a negative impact on the
company’s benets, building simulated interactive environments is used to assess the properties of
algorithms when interacting with users without actually going online. Agent-based user simulations
are adopted by early works in building dialogue systems [
74
], where user behaviors are simulated by
either predened rules [
52
,
67
] or statistical models trained on a small amount of data [
51
]. In recent
years, a series of simulation frameworks or specic environments have been proposed in the eld
of recommender systems. Recsim [
42
] is a general framework for simulating the interactive process
of online recommendations, where the whole recommender system is decomposed into separate
congurable components. Shi et.al. [
80
] propose Virtual-Taobao as an interactive e-commence
environment, in which the user behaviors are given by a neural model adversarially trained on
real logged data. However, the evaluation criteria adopted in existing works are aligned with
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: December 2023.
Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender Systems 3
...
...
Fig. 1. Examples of two typical scenarios in recommender systems. On the le is the search result in an e-
commerce platform, referred to as slate recommendation. On the right is the illustration of a recommendation
queue in music platforms, referred to as sequential recommendation.
traditional online metrics, and there is no simulation framework to specically test the extent to
which recommendation algorithms manipulate user preferences.
To relieve these concerns, in this paper, we provide a general simulation framework Mirror to
evaluate the manipulations of user preferences from recommendation algorithms. The evaluation
consists of four stages. In the rst stage, each simulated user is presented with all the documents
one by one to get a score for each document. Since the scoring happens before the interactions with
recommender systems, and each document itself makes up a slate, such scores are not inuenced
by historically viewed documents or by other documents within the same recommended slate.
Therefore, the obtained scores can serve as the initial preferences of users. In the second stage, we
generate training datasets by using pre-dened initial recommendation strategies to interact with
users. To acquire a diverse set of training datasets to study the impact of how training datasets
inuence trained models, we mix the datasets from multiple initial strategies with dierent ratios.
After that, in the third stage, the recommendation model is trained on those datasets and evaluated
by interacting with simulated users. At the nal stage, the interaction data, as well as users’ initial
preferences, are jointly used to compute evaluation metrics. By incorporating the initial preferences,
we dene the notion of favorite documents and propose several manipulation-aware metrics.
Those metrics can be used in both slate and sequential recommendations, which are two typical
recommendation scenarios as illustrated in Fig. 1. Intuitively, if the users click on more documents
but less proportion of favorite documents, we say the recommendation algorithms are manipulating
the users’ preferences. Also, the manipulations can be quantied by the long-term shift in users’
unbiased preferences.
Under the proposed framework, we conduct extensive studies based on four dierent simulation
experiments, two of which are rule-based and the other two are data-driven. Those environments
cover both slate and sequential recommendations. In slate recommendations, we nd that compared
to simple point-wise models, the reranking methods with the ability to model item relationships
within one slate tend to generate recommendations that have higher clicks but a lower proportion
of clicks on favorite documents. As the training data contains more slates with manipulations,
all the algorithms recommend slates with increasing clicks and decreasing proportion of favorite
documents clicked. Moreover, our case study reveals that the reranking method makes use of
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: December 2023.
4 Z. Zhu et al.
comparison bias, where people tend to choose those documents if they have already seen similar
but worse alternatives within the same list. In sequential recommendations, the experimental
results show that sequential recommendation algorithms that take users’ recent interaction history
into account can lead to more signicant manipulation of users’ preferences.
To summarize, the main contributions of our work are:
We put forward the potential issue of manipulations on user preference in current recom-
mendation systems, and propose a general and congurable framework named Mirror
that can quantify the degree of such manipulations of recommendation algorithms. The
core procedure inside the benchmark is to make simulated users view and score the entire
documents one by one, through which we can access the initial preferences of each user.
By utilizing the proposed framework, we instantiate four benchmark scenarios, ranging from
slate recommendations and sequential recommendations, and including both synthetic and
data-driven user behavior models.
We benchmark several recommendation algorithms under these scenarios, and draw up a
number of key ndings on manipulations from recommendation systems.
In both slate and sequential recommendations, better performances on traditional online
metrics are accompanied by more manipulations on user preference.
In both slate and sequential recommendations, the degree of manipulation from recom-
mendation algorithms is overall positively correlated with the manipulation degree of the
recommendation algorithm that collects the training data.
In slate recommendations, reranking methods are more actively using the mutual inuence
of documents within the slate to improve the overall clicks compared to point-wise ranking
methods. And such inuence often leads users to click on documents they do not originally
favor.
In sequential recommendations, recommendation models that take the user’s recent inter-
action behavior into account can make users choose less proportion of originally favored
documents and induce greater changes to user preferences.
The remaining part of this paper is organized as follows. Section 2provides an overview of
current recommender systems and the emerging concerns on negative impacts. In Section 3, we
make two necessary assumptions to delimit the scope of manipulations studied in this paper. In
Section 4, we introduce our benchmark framework Mirror, including its components, benchmark
procedure and evaluation metrics. In Section 5, we conduct four benchmark experiments and
analyze the results. Related works are summarized in Section 6. We nally conclude this paper in
Section 7.
2 AN OVERVIEW OF CURRENT RECOMMENDER SYSTEMS
We are living in an era of information explosion. With the popularity of the internet, we have
access to far more information than ever before, making it dicult to nd what we need among
the vast amount of content. To save people from information overload and nd the most relevant
content for each of us, modern recommender systems make use of rich user proles to model user
preferences. These systems utilize techniques like collaborative ltering to guess what users might
favor, using not only their own but also other users’ behaviors. The more you use the recommender
system, and the more other people are using the system, the more you will be recommended with
relevant results [49,73].
If we look at the other side, the situation is also thriving. For companies who are deploying large-
scale recommender systems on their products or platforms, personalized recommendation greatly
improves the retention rate of users, as well as the clicks or purchases, depending on the service
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: December 2023.
Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender Systems 5
they provide. The method from a pioneer study [
25
] from Google achieves an average increase
of 38% in click rates on its news platforms with the personalized recommendation, compared to
general methods. Even more gratifyingly, the e-commerce giant e-Bay reported a 500% spike in
Gross Merchandise Volume in online A/B Testing [
18
] with their personalized recommendation.
In industry, recommender systems are continuing to iterate, taking advantage of the increasing
amount of data and computing power available to deliver signicant revenue for businesses.
Fig. 2. The overview of the workflow of a typical recommender system. Here we show user feedback as clicks,
but in practice, user feedback can also be purchases, ratings, or just browsing, etc.
Let us briey review the working mechanism of a standard recommender system. As illustrated
in Fig. 2, we consider the multi-stage recommendation setting with user clicks as feedback. The
users can interact with the recommender system
R
for one or multiple rounds. Each round starts
with a user
𝑢
proposing a query
𝑞
to
R
. During round
𝑟
, the user is recommended with a sorted list
of
𝑇
documents
𝐷𝑟={𝑑𝑟
1, 𝑑𝑟
2, . . . , 𝑑𝑟
𝑇}
, and the user can choose to click on any number of documents,
forming a click sequence 𝐶𝑟={𝑐𝑟
1, 𝑐𝑟
2, . . . , 𝑐𝑟
𝑇}. The process continues until the user exits.
The recommended list
𝐷𝑟
is generated by the recommender system from a three-stage procedure.
A small candidate set
D𝑞
is rst retrieved based on
𝑞
by a recall algorithm from the full document
set. Then the documents in the candidate set are scored and sorted by a ranking algorithm. Top-
𝑇
ranked items form the initial ranking list and are subsequently processed by a reranking algorithm
to get a rened order. The reordered list is used as the recommended list
𝐷𝑟
and displayed to the
ACM Trans. Inf. Syst., Vol. 1, No. 1, Article . Publication date: December 2023.
摘要:

UnderstandingorManipulation:RethinkingOnlinePerformanceGainsofModernRecommenderSystemsZHENGBANGZHU,ShanghaiJiaoTongUniversity,ChinaRONGJUNQIN,NationalKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity,ChinaandPolixirTechnologies,ChinaJUNJIEHUANG,ShanghaiJiaoTongUniversity,ChinaXINYIDAI,Shangh...

展开>> 收起<<
Understanding or Manipulation Rethinking Online Performance Gains of Modern Recommender Systems.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:4.14MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注