
biased evaluation and subjective outcomes (Nießl
et al.,2022;Waseem et al.,2021). The top-leading
systems may dominate the others only on the out-
lier tasks (Agarwal et al.,2021), and their ranking is
inconsistent with other Pythagorean means (Shav-
rina and Malykh,2021). At the same time, the
mean aggregation ignores the relative ordering and
relies on the absolute score difference (Peyrard
et al.,2017), equally treating tasks of different com-
plexity (Mishra and Arunkumar,2021) and from
different domains (Webb,2000).
Novel aggregation principles.
Recent research
has addressed these limitations, introducing novel
aggregation methods and principles. One of the di-
rections frames benchmarking in terms of microe-
conomics, highlighting the importance of the user
utility (Ethayarajh and Jurafsky,2020). The other
studies urge evaluation of technical system proper-
ties in real-world scenarios (Zhou et al.,2021;Ma
et al.,2021) and reliability of system rankings (Ro-
driguez et al.,2021). The benchmarking paradigm
is also shifting towards adopting evaluation prin-
ciples from other fields, such as non-parametric
statistics and social choice theory (Choudhury and
Deshpande,2021;Min et al.,2021;Varshney et al.,
2022;Colombo et al.).
Contributions.
Drawing inspiration from the
social choice theory, we make two application-
oriented contributions and introduce an alternative
tool for benchmark evaluation. First, this paper
proposes VOTE’N’RANK, a flexible framework
to rank systems in multi-task/multi-criteria bench-
marks and aggregate the performances based on
end-user preferences. VOTE’N’RANK includes
8
aggregation procedures that rely on rankings in
each criterion and allow to aggregate homogeneous
and heterogeneous information. The framework is
easy-to-use and allows the users to plug in their
own data. Second, we analyse the framework’s ap-
plication in four case studies: (i) re-ranking three
NLP and multimodal benchmarks; (ii) exploring
under which circumstances a system becomes a
Condorcet winner;(iii) evaluating robustness to
omitted task scores; and (iv) ranking systems in
accordance with user preferences.
We publicly release the VOTE’N’RANK frame-
work
2
to foster further development of reliable and
interpretable benchmark evaluation practices for
both academic and industrial communities.
2github.com/PragmaticsLab/vote_and_rank
2 VOTE’N’RANK
2.1 Background
The study of how individual preferences can be
combined to reach a collective decision is the fo-
cus of social choice theory (Arrow,2012). There
are two main approaches to deal with preferences:
utilitarian and ordinal. The first approach relies
on the so-called cardinal utility, which implies that
there exists some unique utility function for each
individual that defines their preferences. Here, we
can work with utilities as numerical values, and
collective decision making aims to maximise the
social welfare utility. Examples of such utilities are
utilitarian and egalitarian social welfare measures,
where the sum of utilities of individual agents and
the utility of the worst agent get maximised, respec-
tively.
The utilitarian approach has its drawbacks. First,
it implies that kind of utility exists, which is not
always true: individuals can compare two systems
and prefer one to another but cannot say how many
“utils” they got. Second, it assumes that individual
utilities can be compared. The latter is a solid re-
quirement for benchmarking problems, e.g. when
we need to aggregate heterogeneous criteria such
as performance and computational efficiency. In
order to sum them up, one needs a transformation
function that puts the metrics in the same mea-
surement scheme. For example, DYNASCORE (Ma
et al.,2021) utilises Marginal Rate of Substitu-
tion (MRS) from economics as such transformation
function. Third, the utilitarian compensatory prin-
ciple is questionable. Can low performance in one
task/criterion be compensated by high performance
in the others? (Munda,2012)
The ordinal approach has a weaker requirement,
where individuals have preferences (
x
is preferred
to
y
,
xy
, i.e. binary relations over objects),
which should be aggregated in social preference
(also called social rankings). This approach al-
lows us to aggregate rankings from different tasks
and criteria without worrying about measurement
schemes.
2.2 Aggregation Procedures
Definitions.
We adopt the conceptual definitions
from the social choice theory to the objectives of
selecting the best-performing system and ranking a
set of systems as follows: (i) a voter or a criterion is
a task in a given benchmark, and (ii) an alternative
is a system candidate.