
LiLAS integrated STELLA into two academic search systems: LIVIVO (for the task of ranking
documents wrt a head query) and GESIS Search (for the task of ranking datasets wrt a reference
document). We evaluated nine experimental systems contributed by three participating groups.
Overall, we consider our lab as a successful advancement to previous living lab experiments. We
were able to exemplify the benets of fully dockerized systems delivering results for arbitrary
results on-the-y. Furthermore, we could conrm several previous ndings, for instance the
power laws underlying the click distributions.
2. Introduction
Involving users in the early phases of software development – in the form of user experience
analysis and prototype testing – has become common whenever some degree of user interaction
is required. This allows developers to consider the users’ needs from the beginning, making it
easier for users to adopt a new system or adapt to a new version. However, once a system is
put in place, new opportunities to observe, evaluate and learn from users become possible as
more information becomes available. For instance, it becomes possible to observe and record
interaction patterns and answer questions like (a) how long does it take a user to nd a particular
button, (b) how does the user reacts to dierent options or (c) what are the most common paths
used to achieve a goal, whether this goal is buying clothes or nding an article or dataset
relevant for research. All this information can be tracked, stored, analyzed, and evaluated,
making systems more attractive and easier for users. Gathering information from users in order
to continuously evaluate their interaction and learn more from their needs is a common practice
for commercial software as knowing their users and their behavior allows them to predict
their actions, oer them better products that they are likely to buy and, therefore, make more
prot. Despite the benets of user-based evaluation, this approach is not yet fully exploited by
Information Retrieval systems within the academic world.
Information Retrieval (IR) systems are commonly used in academics as they aim at presenting
the most relevant resources from a corpus for an information need. Typical tasks include ranking
a series of documents regarding a query or oering recommendations regarding a document
already selected as relevant (systems taking care of the latter are called recommendation
systems). Traditionally, IR and recommendation systems are used in academics to recover
scholarly articles from specialized repositories designed to this end. Although IR is an active
research area, evaluation remains a challenge in the academic context. One reason for this is
that IR evaluation mainly relies on the Craneld paradigm. In this paradigm, search systems
are compared by processing a set of queries or topics based on a standard corpus of documents
while trying to produce the best possible results. Results are then evaluated with the help of
relevance assessments produced by domain experts. This research method has been established
and proven for more than 25 years in international evaluation campaigns such as the Text
Retrieval Conference TREC
1
or the Conference and Labs Evaluation Forum CLEF
2
. However,
this so-called oine evaluation or shared task principle faces the criticism of drifting away
1https://trec.nist.gov/
2https://www.clef-initiative.eu/