VOTENRANK Revision of Benchmarking with Social Choice Theory Mark Roﬁn1 Vladislav Mikhailov15 Mikhail Florinskiy1 Andrey Kravchenko2Elena Tutubalina13Tatiana Shavrina345

2025-05-01 0 0 1.58MB 17 页 10玖币

侵权投诉

VOTE’N’RANK: Revision of Benchmarking with Social Choice Theory

Mark Roﬁn1∗

, Vladislav Mikhailov1,5∗, Mikhail Florinskiy1∗,

Andrey Kravchenko2,Elena Tutubalina1,3,Tatiana Shavrina3,4,5,

Daniel Karabekyan1,Ekaterina Artemova6†

1HSE University, 2University of Oxford

3Artiﬁcial Intelligence Research Institute, 4Institute of Linguistics, RAS, 5SaluteDevices

6Center for Information and Language Processing, LMU Munich

Correspondence: Ekaterina.Artemova@lmu.de

Abstract

The development of state-of-the-art systems

in different applied areas of machine learn-

ing (ML) is driven by benchmarks, which

have shaped the paradigm of evaluating gen-

eralisation capabilities from multiple perspec-

tives. Although the paradigm is shifting to-

wards more ﬁne-grained evaluation across di-

verse tasks, the delicate question of how to

aggregate the performances has received par-

ticular interest in the community. In gen-

eral, benchmarks follow the unspoken utilitar-

ian principles, where the systems are ranked

based on their mean average score over task-

speciﬁc metrics. Such aggregation proce-

dure has been viewed as a sub-optimal eval-

uation protocol, which may have created the

illusion of progress. This paper proposes

VOTE’N’RANK, a framework for ranking sys-

tems in multi-task benchmarks under the prin-

ciples of the social choice theory. We demon-

strate that our approach can be efﬁciently

utilised to draw new insights on benchmarking

in several ML sub-ﬁelds and identify the best-

performing systems in research and develop-

ment case studies. The VOTE’N’RANK’s pro-

cedures are more robust than the mean aver-

age whilst being able to handle missing per-

formance scores and specify conditions under

which the system becomes the winner.

1 Introduction

Benchmarking has evolved as a conventional prac-

tice for accelerating the development of general-

isable systems in different applied areas of ma-

chine learning (ML). Benchmarks are typically

designed as a collection of datasets, correspond-

ing task-speciﬁc evaluation metrics, and a crite-

rion for summarising the overall performance on

the tasks (Ruder,2021). The benchmark holders

provide public leaderboards, which are utilised by

ML researchers and practitioners for comparing

∗Equal contribution.

†Work done while at HSE University.

novel systems against one another, and, if applica-

ble, human baselines, as well as selecting the best-

performing ones for practical purposes. According

to the benchmark sharing platform PAPERSWITH-

CODE

, the community has put much effort into

creating more than

10,000

inﬂuential benchmarks

in natural language processing (NLP), computer

vision, and knowledge graphs, to name a few.

Criticism of the benchmark pillars.

The bench-

mark methodological foundations have received

wide criticism from academic and industrial com-

munities (Bowman and Dahl,2021). The criticism

covers various aspects of benchmarking, raising

concerns about the construct validity (Raji et al.,

2021), fragility of the design and task choices (De-

hghani et al.,2021), data leakage and annotation ar-

tifacts (Elangovan et al.,2021), SoTA-chasing ten-

dencies at the cost of large carbon footprints (Ben-

der et al.,2021), and low reproducibility of the re-

ported results (Belz et al.,2021), inter alia. Recom-

mendations proposed in these studies are of utmost

importance to benchmark holders, system users,

and developers. However, little attention has been

paid to a more nuanced methodological question:

how to aggregate performance scores in multi-task

benchmarks?

Limits of canonical aggregation.

The appropri-

ateness of mean aggregation in multi-task ML prob-

lems is an ongoing debate in the community. The

mean aggregation procedure implies that all task

metrics are homogeneous (Colombo et al.,2022).

Otherwise, it is recommended to evaluate the sta-

tistical signiﬁcance of differences between models

with non-parametric tests (Demšar,2006;Benavoli

et al.,2016). In practice, the NLP GLUE-style

benchmarks (Wang et al.,2018,2019a,2021;Liang

et al.,2020) use arithmetic average to rank mod-

els over heterogeneous metrics, which may lead to

1URL: paperswithcode.com/sota

Access date

February 6, 2023.

arXiv:2210.05769v3 [cs.LG] 12 Feb 2023

biased evaluation and subjective outcomes (Nießl

et al.,2022;Waseem et al.,2021). The top-leading

systems may dominate the others only on the out-

lier tasks (Agarwal et al.,2021), and their ranking is

inconsistent with other Pythagorean means (Shav-

rina and Malykh,2021). At the same time, the

mean aggregation ignores the relative ordering and

relies on the absolute score difference (Peyrard

et al.,2017), equally treating tasks of different com-

plexity (Mishra and Arunkumar,2021) and from

different domains (Webb,2000).

Novel aggregation principles.

Recent research

has addressed these limitations, introducing novel

aggregation methods and principles. One of the di-

rections frames benchmarking in terms of microe-

conomics, highlighting the importance of the user

utility (Ethayarajh and Jurafsky,2020). The other

studies urge evaluation of technical system proper-

ties in real-world scenarios (Zhou et al.,2021;Ma

et al.,2021) and reliability of system rankings (Ro-

driguez et al.,2021). The benchmarking paradigm

is also shifting towards adopting evaluation prin-

ciples from other ﬁelds, such as non-parametric

statistics and social choice theory (Choudhury and

Deshpande,2021;Min et al.,2021;Varshney et al.,

2022;Colombo et al.).

Contributions.

Drawing inspiration from the

social choice theory, we make two application-

oriented contributions and introduce an alternative

tool for benchmark evaluation. First, this paper

proposes VOTE’N’RANK, a ﬂexible framework

to rank systems in multi-task/multi-criteria bench-

marks and aggregate the performances based on

end-user preferences. VOTE’N’RANK includes

aggregation procedures that rely on rankings in

each criterion and allow to aggregate homogeneous

and heterogeneous information. The framework is

easy-to-use and allows the users to plug in their

own data. Second, we analyse the framework’s ap-

plication in four case studies: (i) re-ranking three

NLP and multimodal benchmarks; (ii) exploring

under which circumstances a system becomes a

Condorcet winner;(iii) evaluating robustness to

omitted task scores; and (iv) ranking systems in

accordance with user preferences.

We publicly release the VOTE’N’RANK frame-

work

to foster further development of reliable and

interpretable benchmark evaluation practices for

both academic and industrial communities.

2github.com/PragmaticsLab/vote_and_rank

2 VOTE’N’RANK

2.1 Background

The study of how individual preferences can be

combined to reach a collective decision is the fo-

cus of social choice theory (Arrow,2012). There

are two main approaches to deal with preferences:

utilitarian and ordinal. The ﬁrst approach relies

on the so-called cardinal utility, which implies that

there exists some unique utility function for each

individual that deﬁnes their preferences. Here, we

can work with utilities as numerical values, and

collective decision making aims to maximise the

social welfare utility. Examples of such utilities are

utilitarian and egalitarian social welfare measures,

where the sum of utilities of individual agents and

the utility of the worst agent get maximised, respec-

tively.

The utilitarian approach has its drawbacks. First,

it implies that kind of utility exists, which is not

always true: individuals can compare two systems

and prefer one to another but cannot say how many

“utils” they got. Second, it assumes that individual

utilities can be compared. The latter is a solid re-

quirement for benchmarking problems, e.g. when

we need to aggregate heterogeneous criteria such

as performance and computational efﬁciency. In

order to sum them up, one needs a transformation

function that puts the metrics in the same mea-

surement scheme. For example, DYNASCORE (Ma

et al.,2021) utilises Marginal Rate of Substitu-

tion (MRS) from economics as such transformation

function. Third, the utilitarian compensatory prin-

ciple is questionable. Can low performance in one

task/criterion be compensated by high performance

in the others? (Munda,2012)

The ordinal approach has a weaker requirement,

where individuals have preferences (

is preferred

xy

, i.e. binary relations over objects),

which should be aggregated in social preference

(also called social rankings). This approach al-

lows us to aggregate rankings from different tasks

and criteria without worrying about measurement

schemes.

2.2 Aggregation Procedures

Deﬁnitions.

We adopt the conceptual deﬁnitions

from the social choice theory to the objectives of

selecting the best-performing system and ranking a

set of systems as follows: (i) a voter or a criterion is

a task in a given benchmark, and (ii) an alternative

is a system candidate.

Objectives.

Suppose we have a set

of systems

m∈ {m1, . . . , m|M|}

from the benchmark includ-

ing a set

of voters

t∈ {t1, . . . , t|T|}

and the cor-

responding criteria

S={smt}m=|M|,t=|T|

m=1,t=1

, where

smt

is the score of system

in task

. Given that,

we pursue two main objectives of the aggregation

procedure

σ:S7→ (M, σ)

:(i) to select the

best performing alternative

m∗

, so that there is no

alternative

ˆm, ˆmσm∗

, and (ii) to rank the al-

ternatives in the descending order according to

values, so that

miσmj

. Here

σ

denotes the

preference resulting from the aggregation proce-

dure σ.

Procedures.

We propose

rules from

different

classes: scoring rules,iterative scoring rules, and

majority-relation based rules. We provide more

details and examples in Appendix A.1.

2.2.1 Scoring rules

The total score of each system is calculated as the

sum of corresponding scores in each task

Sc(m) =

P|M|

i=1 cipi(m)

, where

pi(m)

is the number of tasks

having model

in the

ith

place, and

is the

ith

element of the scoring vector

. The systems with

the highest scores constitute the ﬁnal decision. We

study the following rules that differ in their scoring

vectors.

•Plurality rule applies c= (1,0,...,0).

•

Borda rule operates on

c= (|M| − 1,|M| −

2,...,1,0).

•

Dowdall rule applies the scoring vector

(1,1/2,...,1/|M|).

The scoring vectors are designed to satisfy the

voting rules’ properties mentioned in Table 8 in Ap-

pendix A.2. The scoring vectors’ design is based on

the mathematical foundations of the social choice

theory and is generally accepted in the community

(Aizerman and Aleskerov,1995).

Interpretation.

The Plurality rule is one of the

most widely used in everyday life. It only re-

quires information about the best alternative for

each voter. The Borda rule takes into account in-

formation about all alternatives. It assumes that

differences in positions should be treated the same,

whether it is between the ﬁrst and the second al-

ternatives or the second and the third ones. At

the same time, the Dowdall rule is in some way

in-between Plurality and Borda. It considers infor-

mation about all alternatives but gives more weight

to the difference in the preferences. A similar ap-

proach is used in the Eurovision song contest: they

use

c= (12,10,8,7,...,1)

making the difference

in top positions more important to the outcome.

2.2.2 Iterative scoring rules

•

The Threshold rule applies

(1,1, .., 1,1,0)

. In case of ties scoring

vectors

(1,1, ..., 1,0,0)

, ...,

(1,0, ..., 0,0)

are

iteratively applied and used only to compare

systems with the maximum sum of scores.

•

The Baldwin rule iteratively applies scoring

vectors

(|M| − 1,|M| − 2, ..., 1,0)

(|M| −

2,|M| − 3, ..., 1,0,0)

,...,

(1,0, ..., 0,0)

, and

at each iteration discards systems with the

minimum sum of scores.

Both rules stop the procedure when it is impossible

to break ties or there is only one alternative left.

Interpretation.

The rules are similar in their iter-

ative nature but different in terms of the intuition

behind them. The Threshold rule is based on the

idea that the worst position is what matters the

most. When we start with

c= (1,1, .., 1,1,0)

we choose the alternatives declared worst in the

least amount of cases. Since there can be ties,

additional iterations are used to break them with

c= (1,1, ..., 1,0,0)

and so on; in other words, by

looking at the least-

positions until we have one

alternative left or can not break ties.

The Baldwin rule has two main differences from

Threshold. First, it is based on the Borda score

and considers information from all positions in the

ranking, not only the worst one. Second, whilst

the Threshold rule applies a new vector to the orig-

inal proﬁle and compares only tied alternatives, the

Baldwin rule iteratively eliminates the least scored

systems and moves the remaining up in rankings.

For example, if system

is in the ﬁfth place, but

alternatives from the ﬁrst four places are eliminated

in the ﬁrst rounds,

will be the ﬁrst until it is

eliminated or is among alternatives in the outcome.

2.2.3 Majority-relation based rules

Let us deﬁne a majority relation

over the set

of alternatives as the following binary relation:

mAµmB

iff

is ranked higher than

by more

criteria.

•

Condorcet rule.

is the Condorcet winner

(CW) iff mCµm for any m∈M.

•

Copeland rule. Deﬁne the lower counter set of

systems

as a set of systems dominated by

via

L(mA) = {m∈M, mAµm}

. In

a similar way, deﬁne the upper counter set of

systems

as a set of systems that dominate

via

U(mA) = {m∈M, mµmA}

Deﬁne

u(m) = |L(m)| − |U(m)|

. The ﬁnal

decision is provided by the alternatives with

the highest u(m).

•

Minimax rule. Let

s(mA, mB)

be the num-

ber of criteria for which system

ranked higher than system

mAµmB

s(mA, mB)=0

otherwise. The sys-

tems are ranked according to the formula

rank(mA) = −maxBs(mB, mA).

Interpretation.

CW is the alternative that beats all

the others in pairwise comparison. However, the

Condorcet rule does not declare any winner if the

CW does not exist. The Copeland and Minimax

rules select the CW whenever it exists and solve the

drawback as follows. The Copeland rule selects an

alternative that dominates more alternatives and is

dominated by less (the difference between the num-

bers is maximised). The Minimax rule chooses the

alternative with the minimum number of defeats.

2.3 Properties of the Aggregation Procedures

There is a multitude of voting rules in the social

choice theory (Nurmi,1983;Levin and Nalebuff,

1995;De Almeida et al.,2019;Aleskerov et al.,

2010). The motivation behind our rules

is that

they generally overcome the mean aggregation lim-

itations and vary in their properties, allowing the

user to be more ﬂexible in choosing the rule for

their purposes. The outcomes can be interpreted

in terms of the properties followed or violated by

the rules. We discuss our rules’ properties in Ap-

pendix A.2.

2.4 Framework

Figure 1 describes three supported settings of per-

forming the aggregation objectives. The toy bench-

mark has three evaluated alternatives and consists

of seven voters grouped by the task, e.g. natural

language inference, text classiﬁcation, and question

answering (QA).

A Basic aggregation

: the aggregation procedure

is applied to the leaderboard as is.

B Weighted aggregation

: each voter in the

group is assigned a group weight equal to

We do not consider more complex rules like Kemeny

since it is NP-hard to ﬁnd the Kemeny winner (Bartholdi

et al.,1989), and it is often implemented as the Borda rule

approximation (Colombo et al.).

1/3 voter

1/2 voter

rank

voter

rank

voter

elector

Figure 1: Three ways to run the aggregation proce-

dures. A: Basic aggregation. B: Weighted aggregation.

C: Two-step aggregation.

1/|Tgroup|

. The blue group weights are

1/3

and the orange and the violet group weights

are

1/2

. Each group contributes equally to

the ﬁnal ranking, regardless of the number of

voters.

C Two-step aggregation

: each voter group is

treated as a standalone leaderboard. We in-

dependently apply a procedure to each voter

group and compute an interim ranking shown

as “elector”. Next, we aggregate the group-

wise rankings by applying the same procedure

one more time and compute the ﬁnal ranking.

3 Case Studies

This section describes four case studies on three

NLP and multimodal benchmarks. Our main objec-

tive here is to re-interpret the benchmarking trends

under the social choice theory. We provide a brief

description of the benchmarks below.

•

GLUE (General Language Understanding Eval-

uation; Wang et al.,2018) combines nine

datasets on QA, sentiment analysis, and textual

entailment. GLUE also includes a linguistic

diagnostic test set. |M|=30.

•

SGLUE (Wang et al.,2019a) is the GLUE

follow-up consisting of two diagnostic and eight

more complex NLU tasks, ranging from causal

reasoning to multi-hop and cloze-style QA.

|M|=22.

•

VALUE (Video-and-Language Understanding

Evaluation; Li et al.) covers 11 video-and-

language datasets on text-to-video retrieval,

video QA, and video captioning. |M|=7.

The leaderboards present the results of evaluat-

ing various neural models, such as BERT (Devlin

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VOTE'N'RANK:RevisionofBenchmarkingwithSocialChoiceTheoryMarkRon1,VladislavMikhailov1,5,MikhailFlorinskiy1,AndreyKravchenko2,ElenaTutubalina1,3,TatianaShavrina3,4,5,DanielKarabekyan1,EkaterinaArtemova6y1HSEUniversity,2UniversityofOxford3ArticialIntelligenceResearchInstitute,4InstituteofLinguisti...

展开>> 收起<<

VOTENRANK Revision of Benchmarking with Social Choice Theory Mark Roﬁn1 Vladislav Mikhailov15 Mikhail Florinskiy1 Andrey Kravchenko2Elena Tutubalina13Tatiana Shavrina345.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VOTENRANK Revision of Benchmarking with Social Choice Theory Mark Roﬁn1 Vladislav Mikhailov15 Mikhail Florinskiy1 Andrey Kravchenko2Elena Tutubalina13Tatiana Shavrina345

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: