Take a Fresh Look at Recommender Systems from an Evaluation Standpoint

2025-05-02 0 0 1.27MB 10 页 10玖币
侵权投诉
Take a Fresh Look at Recommender Systems from an Evaluation
Standpoint
Aixin Sun
School of Computer Science and Engineering
Nanyang Technological University
Singapore
axsun@ntu.edu.sg
ABSTRACT
Recommendation has become a prominent area of research in the
eld of Information Retrieval (IR). Evaluation is also a traditional
research topic in this community. Motivated by a few counter-
intuitive observations reported in recent studies, this perspectives
paper takes a fresh look at recommender systems from an eval-
uation standpoint. Rather than examining metrics like recall, hit
rate, or NDCG, or perspectives like novelty and diversity, the key
focus here is on how these metrics are calculated when evaluating a
recommender algorithm. Specically, the commonly used train/test
data splits and their consequences are re-examined. We begin by
examining common data splitting methods, such as random split
or leave-one-out, and discuss why the popularity baseline is poorly
dened under such splits. We then move on to explore the two
implications of neglecting a global timeline during evaluation: data
leakage and oversimplication of user preference modeling. After-
wards, we present new perspectives on recommender systems, in-
cluding techniques for evaluating algorithm performance that more
accurately reect real-world scenarios, and possible approaches to
consider decision contexts in user preference modeling.
CCS CONCEPTS
Information systems Recommender systems
;Collabora-
tive ltering.
KEYWORDS
Recommendation, global timeline, practical evaluation, user prefer-
ence modeling
ACM Reference Format:
Aixin Sun. 2023. Take a Fresh Look at Recommender Systems from an
Evaluation Standpoint. In Proceedings of the 46th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR
’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3539618.3591931
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license.
SIGIR ’23, July 23–27, 2023, Taipei, Taiwan
©2023 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9408-6/23/07.
https://doi.org/10.1145/3539618.3591931
1 INTRODUCTION
Out of all the papers published in SIGIR 2022, 27.5% of them have
titles that include the words “recommender” or “recommendation”.
1
This is a strong indication of research interests on Recommender
Systems (RecSys) in the Information Retrieval (IR) community. As
evaluation is also a traditional research topic in IR, it is interesting
to study how recommendation algorithms are evaluated in general.
More interestingly, a few recent papers report counter-intuitive
observations made from experiments on recommender system, both
in oine and online settings [18, 26, 37, 38, 40].
Here are some example counter-intuitive observations. Ji et al
.
[18]
report that both users who spend more time and users who
have many interactions with a recommendation system receive
poorer recommendations, compared to users who spend less time
or who have relatively fewer interactions with the system. This
observation holds on recommendation results by multiple mod-
els (i.e., BPR [
33
], Neural MF [
14
], LightGCN [
13
], SASRec [
20
]
and TiSASRec [
25
]) on multiple datasets including MovieLens-25M,
Yelp, Amazon-music, and Amazon-electronic. On a large Internet
footwear vendor, through online experiments, Sysko-Romańczuk
et al
. [37]
observe that “experience with the vendor showed a nega-
tive correlation with recommendation performance. The factors
considered under “experience” include the number of days since
account creation, number of days since the rst shopping transac-
tion, and the number and the value of purchase transactions made
in the past year. Another study reports that “using only the more
recent parts of a dataset can drastically improve the performance
of a recommendation system” [40].
We interpret the reported counter-intuitive observations from
two perspectives. First, these observations are made with respect
to the time dimension, more specically, the
global timeline
of
user-item interactions. Here, we are not considering time as an
additional feature or context in the algorithm modeling. Rather,
we consider the arrangement of the user-item interactions by their
timestamps in chronological order during evaluation.
2
Hence, we
have “number of days since the rst transaction” and “recent parts
of a dataset”. The reported counter-intuitive observations call for a
revisit of the importance of observing the global timeline in evaluat-
ing recommender models. Findings from the revisit may impact our
way of conducting evaluation, in turn the model design, and more
importantly, our understanding of recommender system. Second,
these observations are considered counter-intuitive because they
1https://dblp.org/db/conf/sigir/sigir2022.html
2
Although there are recommendation models which consider time as a contextual
feature in their modeling, not many studies arrange user-item interactions along the
global timeline chronologically, and consider the absolute time points of the interactions
in their evaluations. We will use a case study to support this claim shortly.
arXiv:2210.04149v2 [cs.IR] 25 Apr 2023
SIGIR ’23, July 23–27, 2023, Taipei, Taiwan Aixin Sun
Table 1: The 5 oline evaluation settings described in [12], from the ideal and most close simulation of online process (Setting
1) to the least (Setting 5). The last column indicates whether the data split observes global timeline in user actions.
Setting Train/test data split scheme Global timeline
1
Step through user actions in temporal order and make predictions for each user action along the way,
based on the known user actions at the prediction time. Before the testing time point, every user action
serves as a test instance, and subsequently becomes a training instance.
Yes
2
Following Setting 1, instead of evaluating all user actions along time, only evaluate sampled user actions
as test instances. The only dierence to Setting 1 is the reduced number of test instances along the way.
Yes
3
Sample a set of test users, then sample a single test time, and hide all items of test users after that time
point. That is, the data is partitioned to train/test sets based on a single time point.
Partially
4
Sample a test time for each test user (e.g., right before user’s last action), and do not observe global timeline
across all users. Leave-one-out is an example data split scheme under this setting.
No
5
Completely ignore time as in the case that timestamps of user actions are unknown. Data is randomly
partitioned into train and test sets.
No
contradict our expectation (or an implicit assumption) on a recom-
mender. That is, the more interactions a user has with a system,
the higher chance that the recommender better learns the user’s
preference. However, these observations show otherwise.
With global timeline in mind, we conduct a case study to nd out:
to what extent the global timeline is observed in oine evaluation
in academic papers (Section 2). Our case study is based on the full
and industry papers published in the ACM Recommender Systems
conference in the past three years (2020, 2021, 2022). Based on
the ndings, we revisit Popularity, the simplest recommendation
model, in Section 3 to justify why this commonly used baseline is ill-
dened. Then we move on to the discussion on the consequences of
ignoring global timeline in evaluation: data leakage (Section 4) and
simplication of user preference modeling (Section 5). In Section 6,
we propose a fresh look at recommender system from the evaluation
perspective. In Section 7, we present a summary of the key messages
and contributions of this work, after which the paper is concluded
in Section 8.
2 CASE STUDY: DATA SPLIT SCHEMES
Most academic researchers do not have access to an online platform
to directly evaluate their models by real user-item interactions.
Evaluation on an oine dataset is the only choice in most cases.
It is also well known that there are many more factors that may
aect user behaviour online and the prediction power collected
from oine evaluations may or may not be observed online. Hence,
“the goal of the oine experiments is to lter out inappropriate
approaches, leaving a relatively small set of candidate algorithms
to be tested” online, as stated in the evaluation chapter of the
recommender systems handbook [
12
,
34
]. However, to conduct
oine evaluation, “it is necessary to simulate the online process
where the system makes predictions or recommendations” [
12
].
Apparently, a close simulation of online process would make the
results obtained from oine evaluation more indicative, better
serving the purpose of algorithm selection.
Table 1 summarizes the ve settings described in Gunawardana
et al
. [12]
from the ideal setting (Setting 1) of simulating the online
Table 2: Number and percentage of papers by their adopted
data split scheme in ACM RecSys conference (2020 - 2022).
No. and % papers Data split Global timeline
30 34.1% Random split No
22 25.0% Leave-one-out No
17 19.5% Single time point Partially
15 17.0% Simulation-based online Yes
4 4.5% Sliding window Yes
process as close as possible, to the most simplied setting (Setting
5). For simplicity, in our discussion we only consider training and
test instances, and do not consider validation or development set.
We remark that the last two settings (Settings 4 and 5) do not main-
tain or observe global timeline across all users. Hence, these two
settings are not considered as close simulations of the online recom-
mendation processes. As for Setting 3, the partition of train/test sets
is based on a single time point along the global timeline. However,
within the train or test sets, the data instances may not maintain
their temporal order.
To understand which settings are more widely used in evaluating
recommender systems, we conducted a case study to collect the data
split schemes used in the papers published in the last three years
(2020 - 2022) of ACM Recommender Systems conference. The ACM
RecSys conference is considered here for its strong relevance to the
topic and reasonable size. We considered all full papers and industry
papers. However, a good number of papers study recommenders
from system perspective like training eciency, distributed and/or
federated RecSys. Some others focus on user studies and user pref-
erence analysis. Hence, we did not include these papers in the case
study. After ltering, we had 82 full and 9 industry papers which
had clear descriptions of experiment settings. Among them, we fur-
ther excluded another 3 papers. Two of them design experiments
dedicated to cold-start setting, and one paper is on news recom-
mendation and the data is split by news topic in their experiment.
Finally, our case study included 88 papers.
摘要:

TakeaFreshLookatRecommenderSystemsfromanEvaluationStandpointAixinSunSchoolofComputerScienceandEngineeringNanyangTechnologicalUniversitySingaporeaxsun@ntu.edu.sgABSTRACTRecommendationhasbecomeaprominentareaofresearchinthefieldofInformationRetrieval(IR).Evaluationisalsoatraditionalresearchtopicinthisc...

展开>> 收起<<
Take a Fresh Look at Recommender Systems from an Evaluation Standpoint.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.27MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注