Take a Fresh Look at Recommender Systems from an Evaluation Standpoint

2025-05-02 0 0 1.27MB 10 页 10玖币

侵权投诉

Take a Fresh Look at Recommender Systems from an Evaluation

Standpoint

Aixin Sun

School of Computer Science and Engineering

Nanyang Technological University

Singapore

axsun@ntu.edu.sg

ABSTRACT

Recommendation has become a prominent area of research in the

eld of Information Retrieval (IR). Evaluation is also a traditional

research topic in this community. Motivated by a few counter-

intuitive observations reported in recent studies, this perspectives

paper takes a fresh look at recommender systems from an eval-

uation standpoint. Rather than examining metrics like recall, hit

rate, or NDCG, or perspectives like novelty and diversity, the key

focus here is on how these metrics are calculated when evaluating a

recommender algorithm. Specically, the commonly used train/test

data splits and their consequences are re-examined. We begin by

examining common data splitting methods, such as random split

or leave-one-out, and discuss why the popularity baseline is poorly

dened under such splits. We then move on to explore the two

implications of neglecting a global timeline during evaluation: data

leakage and oversimplication of user preference modeling. After-

wards, we present new perspectives on recommender systems, in-

cluding techniques for evaluating algorithm performance that more

accurately reect real-world scenarios, and possible approaches to

consider decision contexts in user preference modeling.

CCS CONCEPTS

•Information systems →Recommender systems

;Collabora-

tive ltering.

KEYWORDS

Recommendation, global timeline, practical evaluation, user prefer-

ence modeling

ACM Reference Format:

Aixin Sun. 2023. Take a Fresh Look at Recommender Systems from an

Evaluation Standpoint. In Proceedings of the 46th International ACM SIGIR

Conference on Research and Development in Information Retrieval (SIGIR

’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 10 pages.

https://doi.org/10.1145/3539618.3591931

This paper is published under the Creative Commons Attribution 4.0 International

(CC-BY 4.0) license.

SIGIR ’23, July 23–27, 2023, Taipei, Taiwan

ACM ISBN 978-1-4503-9408-6/23/07.

https://doi.org/10.1145/3539618.3591931

1 INTRODUCTION

Out of all the papers published in SIGIR 2022, 27.5% of them have

titles that include the words “recommender” or “recommendation”.

This is a strong indication of research interests on Recommender

Systems (RecSys) in the Information Retrieval (IR) community. As

evaluation is also a traditional research topic in IR, it is interesting

to study how recommendation algorithms are evaluated in general.

More interestingly, a few recent papers report counter-intuitive

observations made from experiments on recommender system, both

in oine and online settings [18, 26, 37, 38, 40].

Here are some example counter-intuitive observations. Ji et al

[18]

report that both users who spend more time and users who

have many interactions with a recommendation system receive

poorer recommendations, compared to users who spend less time

or who have relatively fewer interactions with the system. This

observation holds on recommendation results by multiple mod-

els (i.e., BPR [

], Neural MF [

], LightGCN [

], SASRec [

]

and TiSASRec [

]) on multiple datasets including MovieLens-25M,

Yelp, Amazon-music, and Amazon-electronic. On a large Internet

footwear vendor, through online experiments, Sysko-Romańczuk

et al

. [37]

observe that “experience with the vendor showed a nega-

tive correlation with recommendation performance”. The factors

considered under “experience” include the number of days since

account creation, number of days since the rst shopping transac-

tion, and the number and the value of purchase transactions made

in the past year. Another study reports that “using only the more

recent parts of a dataset can drastically improve the performance

of a recommendation system” [40].

We interpret the reported counter-intuitive observations from

two perspectives. First, these observations are made with respect

to the time dimension, more specically, the

global timeline

user-item interactions. Here, we are not considering time as an

additional feature or context in the algorithm modeling. Rather,

we consider the arrangement of the user-item interactions by their

timestamps in chronological order during evaluation.

Hence, we

have “number of days since the rst transaction” and “recent parts

of a dataset”. The reported counter-intuitive observations call for a

revisit of the importance of observing the global timeline in evaluat-

ing recommender models. Findings from the revisit may impact our

way of conducting evaluation, in turn the model design, and more

importantly, our understanding of recommender system. Second,

these observations are considered counter-intuitive because they

1https://dblp.org/db/conf/sigir/sigir2022.html

Although there are recommendation models which consider time as a contextual

feature in their modeling, not many studies arrange user-item interactions along the

global timeline chronologically, and consider the absolute time points of the interactions

in their evaluations. We will use a case study to support this claim shortly.

arXiv:2210.04149v2 [cs.IR] 25 Apr 2023

SIGIR ’23, July 23–27, 2023, Taipei, Taiwan Aixin Sun

Table 1: The 5 oline evaluation settings described in [12], from the ideal and most close simulation of online process (Setting

1) to the least (Setting 5). The last column indicates whether the data split observes global timeline in user actions.

Setting Train/test data split scheme Global timeline

Step through user actions in temporal order and make predictions for each user action along the way,

based on the known user actions at the prediction time. Before the testing time point, every user action

serves as a test instance, and subsequently becomes a training instance.

Yes

Following Setting 1, instead of evaluating all user actions along time, only evaluate sampled user actions

as test instances. The only dierence to Setting 1 is the reduced number of test instances along the way.

Yes

Sample a set of test users, then sample a single test time, and hide all items of test users after that time

point. That is, the data is partitioned to train/test sets based on a single time point.

Partially

Sample a test time for each test user (e.g., right before user’s last action), and do not observe global timeline

across all users. Leave-one-out is an example data split scheme under this setting.

Completely ignore time as in the case that timestamps of user actions are unknown. Data is randomly

partitioned into train and test sets.

contradict our expectation (or an implicit assumption) on a recom-

mender. That is, the more interactions a user has with a system,

the higher chance that the recommender better learns the user’s

preference. However, these observations show otherwise.

With global timeline in mind, we conduct a case study to nd out:

to what extent the global timeline is observed in oine evaluation

in academic papers (Section 2). Our case study is based on the full

and industry papers published in the ACM Recommender Systems

conference in the past three years (2020, 2021, 2022). Based on

the ndings, we revisit Popularity, the simplest recommendation

model, in Section 3 to justify why this commonly used baseline is ill-

dened. Then we move on to the discussion on the consequences of

ignoring global timeline in evaluation: data leakage (Section 4) and

simplication of user preference modeling (Section 5). In Section 6,

we propose a fresh look at recommender system from the evaluation

perspective. In Section 7, we present a summary of the key messages

and contributions of this work, after which the paper is concluded

in Section 8.

2 CASE STUDY: DATA SPLIT SCHEMES

Most academic researchers do not have access to an online platform

to directly evaluate their models by real user-item interactions.

Evaluation on an oine dataset is the only choice in most cases.

It is also well known that there are many more factors that may

aect user behaviour online and the prediction power collected

from oine evaluations may or may not be observed online. Hence,

“the goal of the oine experiments is to lter out inappropriate

approaches, leaving a relatively small set of candidate algorithms

to be tested” online, as stated in the evaluation chapter of the

recommender systems handbook [

]. However, to conduct

oine evaluation, “it is necessary to simulate the online process

where the system makes predictions or recommendations” [

Apparently, a close simulation of online process would make the

results obtained from oine evaluation more indicative, better

serving the purpose of algorithm selection.

Table 1 summarizes the ve settings described in Gunawardana

et al

. [12]

from the ideal setting (Setting 1) of simulating the online

Table 2: Number and percentage of papers by their adopted

data split scheme in ACM RecSys conference (2020 - 2022).

No. and % papers Data split Global timeline

30 34.1% Random split No

22 25.0% Leave-one-out No

17 19.5% Single time point Partially

15 17.0% Simulation-based online Yes

4 4.5% Sliding window Yes

process as close as possible, to the most simplied setting (Setting

5). For simplicity, in our discussion we only consider training and

test instances, and do not consider validation or development set.

We remark that the last two settings (Settings 4 and 5) do not main-

tain or observe global timeline across all users. Hence, these two

settings are not considered as close simulations of the online recom-

mendation processes. As for Setting 3, the partition of train/test sets

is based on a single time point along the global timeline. However,

within the train or test sets, the data instances may not maintain

their temporal order.

To understand which settings are more widely used in evaluating

recommender systems, we conducted a case study to collect the data

split schemes used in the papers published in the last three years

(2020 - 2022) of ACM Recommender Systems conference. The ACM

RecSys conference is considered here for its strong relevance to the

topic and reasonable size. We considered all full papers and industry

papers. However, a good number of papers study recommenders

from system perspective like training eciency, distributed and/or

federated RecSys. Some others focus on user studies and user pref-

erence analysis. Hence, we did not include these papers in the case

study. After ltering, we had 82 full and 9 industry papers which

had clear descriptions of experiment settings. Among them, we fur-

ther excluded another 3 papers. Two of them design experiments

dedicated to cold-start setting, and one paper is on news recom-

mendation and the data is split by news topic in their experiment.

Finally, our case study included 88 papers.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TakeaFreshLookatRecommenderSystemsfromanEvaluationStandpointAixinSunSchoolofComputerScienceandEngineeringNanyangTechnologicalUniversitySingaporeaxsun@ntu.edu.sgABSTRACTRecommendationhasbecomeaprominentareaofresearchinthefieldofInformationRetrieval(IR).Evaluationisalsoatraditionalresearchtopicinthisc...

展开>> 收起<<

Take a Fresh Look at Recommender Systems from an Evaluation Standpoint.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Take a Fresh Look at Recommender Systems from an Evaluation Standpoint

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: