
SIGIR ’23, July 23–27, 2023, Taipei, Taiwan Aixin Sun
Table 1: The 5 oline evaluation settings described in [12], from the ideal and most close simulation of online process (Setting
1) to the least (Setting 5). The last column indicates whether the data split observes global timeline in user actions.
Setting Train/test data split scheme Global timeline
1
Step through user actions in temporal order and make predictions for each user action along the way,
based on the known user actions at the prediction time. Before the testing time point, every user action
serves as a test instance, and subsequently becomes a training instance.
Yes
2
Following Setting 1, instead of evaluating all user actions along time, only evaluate sampled user actions
as test instances. The only dierence to Setting 1 is the reduced number of test instances along the way.
Yes
3
Sample a set of test users, then sample a single test time, and hide all items of test users after that time
point. That is, the data is partitioned to train/test sets based on a single time point.
Partially
4
Sample a test time for each test user (e.g., right before user’s last action), and do not observe global timeline
across all users. Leave-one-out is an example data split scheme under this setting.
No
5
Completely ignore time as in the case that timestamps of user actions are unknown. Data is randomly
partitioned into train and test sets.
No
contradict our expectation (or an implicit assumption) on a recom-
mender. That is, the more interactions a user has with a system,
the higher chance that the recommender better learns the user’s
preference. However, these observations show otherwise.
With global timeline in mind, we conduct a case study to nd out:
to what extent the global timeline is observed in oine evaluation
in academic papers (Section 2). Our case study is based on the full
and industry papers published in the ACM Recommender Systems
conference in the past three years (2020, 2021, 2022). Based on
the ndings, we revisit Popularity, the simplest recommendation
model, in Section 3 to justify why this commonly used baseline is ill-
dened. Then we move on to the discussion on the consequences of
ignoring global timeline in evaluation: data leakage (Section 4) and
simplication of user preference modeling (Section 5). In Section 6,
we propose a fresh look at recommender system from the evaluation
perspective. In Section 7, we present a summary of the key messages
and contributions of this work, after which the paper is concluded
in Section 8.
2 CASE STUDY: DATA SPLIT SCHEMES
Most academic researchers do not have access to an online platform
to directly evaluate their models by real user-item interactions.
Evaluation on an oine dataset is the only choice in most cases.
It is also well known that there are many more factors that may
aect user behaviour online and the prediction power collected
from oine evaluations may or may not be observed online. Hence,
“the goal of the oine experiments is to lter out inappropriate
approaches, leaving a relatively small set of candidate algorithms
to be tested” online, as stated in the evaluation chapter of the
recommender systems handbook [
12
,
34
]. However, to conduct
oine evaluation, “it is necessary to simulate the online process
where the system makes predictions or recommendations” [
12
].
Apparently, a close simulation of online process would make the
results obtained from oine evaluation more indicative, better
serving the purpose of algorithm selection.
Table 1 summarizes the ve settings described in Gunawardana
et al
. [12]
from the ideal setting (Setting 1) of simulating the online
Table 2: Number and percentage of papers by their adopted
data split scheme in ACM RecSys conference (2020 - 2022).
No. and % papers Data split Global timeline
30 34.1% Random split No
22 25.0% Leave-one-out No
17 19.5% Single time point Partially
15 17.0% Simulation-based online Yes
4 4.5% Sliding window Yes
process as close as possible, to the most simplied setting (Setting
5). For simplicity, in our discussion we only consider training and
test instances, and do not consider validation or development set.
We remark that the last two settings (Settings 4 and 5) do not main-
tain or observe global timeline across all users. Hence, these two
settings are not considered as close simulations of the online recom-
mendation processes. As for Setting 3, the partition of train/test sets
is based on a single time point along the global timeline. However,
within the train or test sets, the data instances may not maintain
their temporal order.
To understand which settings are more widely used in evaluating
recommender systems, we conducted a case study to collect the data
split schemes used in the papers published in the last three years
(2020 - 2022) of ACM Recommender Systems conference. The ACM
RecSys conference is considered here for its strong relevance to the
topic and reasonable size. We considered all full papers and industry
papers. However, a good number of papers study recommenders
from system perspective like training eciency, distributed and/or
federated RecSys. Some others focus on user studies and user pref-
erence analysis. Hence, we did not include these papers in the case
study. After ltering, we had 82 full and 9 industry papers which
had clear descriptions of experiment settings. Among them, we fur-
ther excluded another 3 papers. Two of them design experiments
dedicated to cold-start setting, and one paper is on news recom-
mendation and the data is split by news topic in their experiment.
Finally, our case study included 88 papers.