VOTENRANK Revision of Benchmarking with Social Choice Theory Mark Rofin1 Vladislav Mikhailov15 Mikhail Florinskiy1 Andrey Kravchenko2Elena Tutubalina13Tatiana Shavrina345

2025-05-01 0 0 1.58MB 17 页 10玖币
侵权投诉
VOTENRANK: Revision of Benchmarking with Social Choice Theory
Mark Rofin1
, Vladislav Mikhailov1,5, Mikhail Florinskiy1,
Andrey Kravchenko2,Elena Tutubalina1,3,Tatiana Shavrina3,4,5,
Daniel Karabekyan1,Ekaterina Artemova6
1HSE University, 2University of Oxford
3Artificial Intelligence Research Institute, 4Institute of Linguistics, RAS, 5SaluteDevices
6Center for Information and Language Processing, LMU Munich
Correspondence: Ekaterina.Artemova@lmu.de
Abstract
The development of state-of-the-art systems
in different applied areas of machine learn-
ing (ML) is driven by benchmarks, which
have shaped the paradigm of evaluating gen-
eralisation capabilities from multiple perspec-
tives. Although the paradigm is shifting to-
wards more fine-grained evaluation across di-
verse tasks, the delicate question of how to
aggregate the performances has received par-
ticular interest in the community. In gen-
eral, benchmarks follow the unspoken utilitar-
ian principles, where the systems are ranked
based on their mean average score over task-
specific metrics. Such aggregation proce-
dure has been viewed as a sub-optimal eval-
uation protocol, which may have created the
illusion of progress. This paper proposes
VOTENRANK, a framework for ranking sys-
tems in multi-task benchmarks under the prin-
ciples of the social choice theory. We demon-
strate that our approach can be efficiently
utilised to draw new insights on benchmarking
in several ML sub-fields and identify the best-
performing systems in research and develop-
ment case studies. The VOTENRANKs pro-
cedures are more robust than the mean aver-
age whilst being able to handle missing per-
formance scores and specify conditions under
which the system becomes the winner.
1 Introduction
Benchmarking has evolved as a conventional prac-
tice for accelerating the development of general-
isable systems in different applied areas of ma-
chine learning (ML). Benchmarks are typically
designed as a collection of datasets, correspond-
ing task-specific evaluation metrics, and a crite-
rion for summarising the overall performance on
the tasks (Ruder,2021). The benchmark holders
provide public leaderboards, which are utilised by
ML researchers and practitioners for comparing
Equal contribution.
Work done while at HSE University.
novel systems against one another, and, if applica-
ble, human baselines, as well as selecting the best-
performing ones for practical purposes. According
to the benchmark sharing platform PAPERSWITH-
CODE
1
, the community has put much effort into
creating more than
10,000
influential benchmarks
in natural language processing (NLP), computer
vision, and knowledge graphs, to name a few.
Criticism of the benchmark pillars.
The bench-
mark methodological foundations have received
wide criticism from academic and industrial com-
munities (Bowman and Dahl,2021). The criticism
covers various aspects of benchmarking, raising
concerns about the construct validity (Raji et al.,
2021), fragility of the design and task choices (De-
hghani et al.,2021), data leakage and annotation ar-
tifacts (Elangovan et al.,2021), SoTA-chasing ten-
dencies at the cost of large carbon footprints (Ben-
der et al.,2021), and low reproducibility of the re-
ported results (Belz et al.,2021), inter alia. Recom-
mendations proposed in these studies are of utmost
importance to benchmark holders, system users,
and developers. However, little attention has been
paid to a more nuanced methodological question:
how to aggregate performance scores in multi-task
benchmarks?
Limits of canonical aggregation.
The appropri-
ateness of mean aggregation in multi-task ML prob-
lems is an ongoing debate in the community. The
mean aggregation procedure implies that all task
metrics are homogeneous (Colombo et al.,2022).
Otherwise, it is recommended to evaluate the sta-
tistical significance of differences between models
with non-parametric tests (Demšar,2006;Benavoli
et al.,2016). In practice, the NLP GLUE-style
benchmarks (Wang et al.,2018,2019a,2021;Liang
et al.,2020) use arithmetic average to rank mod-
els over heterogeneous metrics, which may lead to
1URL: paperswithcode.com/sota
.
Access date
:
February 6, 2023.
arXiv:2210.05769v3 [cs.LG] 12 Feb 2023
biased evaluation and subjective outcomes (Nießl
et al.,2022;Waseem et al.,2021). The top-leading
systems may dominate the others only on the out-
lier tasks (Agarwal et al.,2021), and their ranking is
inconsistent with other Pythagorean means (Shav-
rina and Malykh,2021). At the same time, the
mean aggregation ignores the relative ordering and
relies on the absolute score difference (Peyrard
et al.,2017), equally treating tasks of different com-
plexity (Mishra and Arunkumar,2021) and from
different domains (Webb,2000).
Novel aggregation principles.
Recent research
has addressed these limitations, introducing novel
aggregation methods and principles. One of the di-
rections frames benchmarking in terms of microe-
conomics, highlighting the importance of the user
utility (Ethayarajh and Jurafsky,2020). The other
studies urge evaluation of technical system proper-
ties in real-world scenarios (Zhou et al.,2021;Ma
et al.,2021) and reliability of system rankings (Ro-
driguez et al.,2021). The benchmarking paradigm
is also shifting towards adopting evaluation prin-
ciples from other fields, such as non-parametric
statistics and social choice theory (Choudhury and
Deshpande,2021;Min et al.,2021;Varshney et al.,
2022;Colombo et al.).
Contributions.
Drawing inspiration from the
social choice theory, we make two application-
oriented contributions and introduce an alternative
tool for benchmark evaluation. First, this paper
proposes VOTENRANK, a flexible framework
to rank systems in multi-task/multi-criteria bench-
marks and aggregate the performances based on
end-user preferences. VOTENRANK includes
8
aggregation procedures that rely on rankings in
each criterion and allow to aggregate homogeneous
and heterogeneous information. The framework is
easy-to-use and allows the users to plug in their
own data. Second, we analyse the framework’s ap-
plication in four case studies: (i) re-ranking three
NLP and multimodal benchmarks; (ii) exploring
under which circumstances a system becomes a
Condorcet winner;(iii) evaluating robustness to
omitted task scores; and (iv) ranking systems in
accordance with user preferences.
We publicly release the VOTENRANK frame-
work
2
to foster further development of reliable and
interpretable benchmark evaluation practices for
both academic and industrial communities.
2github.com/PragmaticsLab/vote_and_rank
2 VOTENRANK
2.1 Background
The study of how individual preferences can be
combined to reach a collective decision is the fo-
cus of social choice theory (Arrow,2012). There
are two main approaches to deal with preferences:
utilitarian and ordinal. The first approach relies
on the so-called cardinal utility, which implies that
there exists some unique utility function for each
individual that defines their preferences. Here, we
can work with utilities as numerical values, and
collective decision making aims to maximise the
social welfare utility. Examples of such utilities are
utilitarian and egalitarian social welfare measures,
where the sum of utilities of individual agents and
the utility of the worst agent get maximised, respec-
tively.
The utilitarian approach has its drawbacks. First,
it implies that kind of utility exists, which is not
always true: individuals can compare two systems
and prefer one to another but cannot say how many
“utils” they got. Second, it assumes that individual
utilities can be compared. The latter is a solid re-
quirement for benchmarking problems, e.g. when
we need to aggregate heterogeneous criteria such
as performance and computational efficiency. In
order to sum them up, one needs a transformation
function that puts the metrics in the same mea-
surement scheme. For example, DYNASCORE (Ma
et al.,2021) utilises Marginal Rate of Substitu-
tion (MRS) from economics as such transformation
function. Third, the utilitarian compensatory prin-
ciple is questionable. Can low performance in one
task/criterion be compensated by high performance
in the others? (Munda,2012)
The ordinal approach has a weaker requirement,
where individuals have preferences (
x
is preferred
to
y
,
xy
, i.e. binary relations over objects),
which should be aggregated in social preference
(also called social rankings). This approach al-
lows us to aggregate rankings from different tasks
and criteria without worrying about measurement
schemes.
2.2 Aggregation Procedures
Definitions.
We adopt the conceptual definitions
from the social choice theory to the objectives of
selecting the best-performing system and ranking a
set of systems as follows: (i) a voter or a criterion is
a task in a given benchmark, and (ii) an alternative
is a system candidate.
Objectives.
Suppose we have a set
M
of systems
m∈ {m1, . . . , m|M|}
from the benchmark includ-
ing a set
T
of voters
t∈ {t1, . . . , t|T|}
and the cor-
responding criteria
S={smt}m=|M|,t=|T|
m=1,t=1
, where
smt
is the score of system
m
in task
t
. Given that,
we pursue two main objectives of the aggregation
procedure
σ
,
σ:S7→ (M, σ)
:(i) to select the
best performing alternative
m
, so that there is no
alternative
ˆm, ˆmσm
, and (ii) to rank the al-
ternatives in the descending order according to
σ
values, so that
miσmj
. Here
σ
denotes the
preference resulting from the aggregation proce-
dure σ.
Procedures.
We propose
8
rules from
3
different
classes: scoring rules,iterative scoring rules, and
majority-relation based rules. We provide more
details and examples in Appendix A.1.
2.2.1 Scoring rules
The total score of each system is calculated as the
sum of corresponding scores in each task
Sc(m) =
P|M|
i=1 cipi(m)
, where
pi(m)
is the number of tasks
having model
m
in the
ith
place, and
ci
is the
ith
element of the scoring vector
c
. The systems with
the highest scores constitute the final decision. We
study the following rules that differ in their scoring
vectors.
Plurality rule applies c= (1,0,...,0).
Borda rule operates on
c= (|M| − 1,|M| −
2,...,1,0).
Dowdall rule applies the scoring vector
c=
(1,1/2,...,1/|M|).
The scoring vectors are designed to satisfy the
voting rules’ properties mentioned in Table 8 in Ap-
pendix A.2. The scoring vectors’ design is based on
the mathematical foundations of the social choice
theory and is generally accepted in the community
(Aizerman and Aleskerov,1995).
Interpretation.
The Plurality rule is one of the
most widely used in everyday life. It only re-
quires information about the best alternative for
each voter. The Borda rule takes into account in-
formation about all alternatives. It assumes that
differences in positions should be treated the same,
whether it is between the first and the second al-
ternatives or the second and the third ones. At
the same time, the Dowdall rule is in some way
in-between Plurality and Borda. It considers infor-
mation about all alternatives but gives more weight
to the difference in the preferences. A similar ap-
proach is used in the Eurovision song contest: they
use
c= (12,10,8,7,...,1)
making the difference
in top positions more important to the outcome.
2.2.2 Iterative scoring rules
The Threshold rule applies
c=
(1,1, .., 1,1,0)
. In case of ties scoring
vectors
(1,1, ..., 1,0,0)
, ...,
(1,0, ..., 0,0)
are
iteratively applied and used only to compare
systems with the maximum sum of scores.
The Baldwin rule iteratively applies scoring
vectors
(|M| − 1,|M| − 2, ..., 1,0)
,
(|M| −
2,|M| − 3, ..., 1,0,0)
,...,
(1,0, ..., 0,0)
, and
at each iteration discards systems with the
minimum sum of scores.
Both rules stop the procedure when it is impossible
to break ties or there is only one alternative left.
Interpretation.
The rules are similar in their iter-
ative nature but different in terms of the intuition
behind them. The Threshold rule is based on the
idea that the worst position is what matters the
most. When we start with
c= (1,1, .., 1,1,0)
,
we choose the alternatives declared worst in the
least amount of cases. Since there can be ties,
additional iterations are used to break them with
c= (1,1, ..., 1,0,0)
and so on; in other words, by
looking at the least-
k
positions until we have one
alternative left or can not break ties.
The Baldwin rule has two main differences from
Threshold. First, it is based on the Borda score
and considers information from all positions in the
ranking, not only the worst one. Second, whilst
the Threshold rule applies a new vector to the orig-
inal profile and compares only tied alternatives, the
Baldwin rule iteratively eliminates the least scored
systems and moves the remaining up in rankings.
For example, if system
mA
is in the fifth place, but
alternatives from the first four places are eliminated
in the first rounds,
mA
will be the first until it is
eliminated or is among alternatives in the outcome.
2.2.3 Majority-relation based rules
Let us define a majority relation
µ
over the set
of alternatives as the following binary relation:
mAµmB
iff
mA
is ranked higher than
mB
by more
criteria.
Condorcet rule.
mC
is the Condorcet winner
(CW) iff mCµm for any mM.
Copeland rule. Define the lower counter set of
systems
mA
as a set of systems dominated by
mA
via
µ
:
L(mA) = {mM, mAµm}
. In
a similar way, define the upper counter set of
systems
mA
as a set of systems that dominate
mA
via
µ
:
U(mA) = {mM, mµmA}
.
Define
u(m) = |L(m)| − |U(m)|
. The final
decision is provided by the alternatives with
the highest u(m).
Minimax rule. Let
s(mA, mB)
be the num-
ber of criteria for which system
mA
is
ranked higher than system
mB
if
mAµmB
or
s(mA, mB)=0
otherwise. The sys-
tems are ranked according to the formula
rank(mA) = maxBs(mB, mA).
Interpretation.
CW is the alternative that beats all
the others in pairwise comparison. However, the
Condorcet rule does not declare any winner if the
CW does not exist. The Copeland and Minimax
rules select the CW whenever it exists and solve the
drawback as follows. The Copeland rule selects an
alternative that dominates more alternatives and is
dominated by less (the difference between the num-
bers is maximised). The Minimax rule chooses the
alternative with the minimum number of defeats.
2.3 Properties of the Aggregation Procedures
There is a multitude of voting rules in the social
choice theory (Nurmi,1983;Levin and Nalebuff,
1995;De Almeida et al.,2019;Aleskerov et al.,
2010). The motivation behind our rules
3
is that
they generally overcome the mean aggregation lim-
itations and vary in their properties, allowing the
user to be more flexible in choosing the rule for
their purposes. The outcomes can be interpreted
in terms of the properties followed or violated by
the rules. We discuss our rules’ properties in Ap-
pendix A.2.
2.4 Framework
Figure 1 describes three supported settings of per-
forming the aggregation objectives. The toy bench-
mark has three evaluated alternatives and consists
of seven voters grouped by the task, e.g. natural
language inference, text classification, and question
answering (QA).
A Basic aggregation
: the aggregation procedure
is applied to the leaderboard as is.
B Weighted aggregation
: each voter in the
group is assigned a group weight equal to
3
We do not consider more complex rules like Kemeny
since it is NP-hard to find the Kemeny winner (Bartholdi
et al.,1989), and it is often implemented as the Borda rule
approximation (Colombo et al.).
1/3 voter
1/3 voter
1/3 voter
1/2 voter
1/2 voter
1/2 voter
1/2 voter
rank
rank
voter
voter
voter
voter
voter
voter
voter
A
B
C
rank
voter
voter
voter
voter
voter
voter
voter
elector
elector
elector
Figure 1: Three ways to run the aggregation proce-
dures. A: Basic aggregation. B: Weighted aggregation.
C: Two-step aggregation.
1/|Tgroup|
. The blue group weights are
1/3
,
and the orange and the violet group weights
are
1/2
. Each group contributes equally to
the final ranking, regardless of the number of
voters.
C Two-step aggregation
: each voter group is
treated as a standalone leaderboard. We in-
dependently apply a procedure to each voter
group and compute an interim ranking shown
as “elector”. Next, we aggregate the group-
wise rankings by applying the same procedure
one more time and compute the final ranking.
3 Case Studies
This section describes four case studies on three
NLP and multimodal benchmarks. Our main objec-
tive here is to re-interpret the benchmarking trends
under the social choice theory. We provide a brief
description of the benchmarks below.
GLUE (General Language Understanding Eval-
uation; Wang et al.,2018) combines nine
datasets on QA, sentiment analysis, and textual
entailment. GLUE also includes a linguistic
diagnostic test set. |M|=30.
SGLUE (Wang et al.,2019a) is the GLUE
follow-up consisting of two diagnostic and eight
more complex NLU tasks, ranging from causal
reasoning to multi-hop and cloze-style QA.
|M|=22.
VALUE (Video-and-Language Understanding
Evaluation; Li et al.) covers 11 video-and-
language datasets on text-to-video retrieval,
video QA, and video captioning. |M|=7.
The leaderboards present the results of evaluat-
ing various neural models, such as BERT (Devlin
摘要:

VOTE'N'RANK:RevisionofBenchmarkingwithSocialChoiceTheoryMarkRon1,VladislavMikhailov1,5,MikhailFlorinskiy1,AndreyKravchenko2,ElenaTutubalina1,3,TatianaShavrina3,4,5,DanielKarabekyan1,EkaterinaArtemova6y1HSEUniversity,2UniversityofOxford3ArticialIntelligenceResearchInstitute,4InstituteofLinguisti...

展开>> 收起<<
VOTENRANK Revision of Benchmarking with Social Choice Theory Mark Rofin1 Vladislav Mikhailov15 Mikhail Florinskiy1 Andrey Kravchenko2Elena Tutubalina13Tatiana Shavrina345.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.58MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注