Evaluate Evaluation on the Hub Better Best Practices for Data and Model Measurements Leandro von Werra Lewis Tunstall Abhishek Thakur Alexandra Sasha Luccioni

2025-05-06 0 0 1.72MB 9 页 10玖币

侵权投诉

Evaluate & Evaluation on the Hub:

Better Best Practices for Data and Model Measurements

Leandro von Werra∗

, Lewis Tunstall ∗

, Abhishek Thakur ∗

, Alexandra Sasha Luccioni ∗

Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani,

Victor Mustar, Helen Ngo, Omar Sanseviero, Mario Šaško,

Albert Villanova, Quentin Lhoest, Julien Chaumond,

Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela

Hugging Face, Inc.

{leandro,lewis,abhishek,sasha.luccioni,douwe}@huggingface.co

Abstract

Evaluation is a key part of machine learning

(ML), yet there is a lack of support and tool-

ing to enable its informed and systematic prac-

tice. We introduce Evaluate and Evalua-

tion on the Hub—a set of tools to facilitate the

evaluation of models and datasets in ML.

Evaluate is a library to support best practices

for measurements, metrics, and comparisons

of data and models. Its goal is to support

reproducibility of evaluation, centralize and

document the evaluation process, and broaden

evaluation to cover more facets of model

performance. It includes over 50 efﬁcient

canonical implementations for a variety of do-

mains and scenarios, interactive documenta-

tion, and the ability to easily share implemen-

tations and outcomes. The library is available

at https://github.com/huggingface/evaluate. In

addition, we introduce Evaluation on the Hub,

a platform that enables the large-scale evalua-

tion of over 75,000 models and 11,000 datasets

on the Hugging Face Hub, for free, at the click

of a button. Evaluation on the Hub is available

at https://huggingface.co/autoevaluate.

Demo screencast: youtu.be/6rU177zRj8Q

1 Introduction

Evaluation is a crucial cornerstone of machine

learning—not only can it help us gauge whether

and how much progress we are making as a ﬁeld, it

can also help determine which model is most suit-

able for deployment in a given use case. However,

while the progress made in terms of hardware and

algorithms might look incredible to a ML practi-

tioner from several decades ago, the way we eval-

uate models has changed very little. In fact, there

is an emerging consensus that in order to meaning-

fully track progress in our ﬁeld, we need to address

serious issues in the way in which we evaluate ML

systems (Kiela et al.,2021;Bowman and Dahl,

2021;Raji et al.,2021;Hutchinson et al.,2022).

∗Equal contribution.

Figure 1: Average number of evaluation datasets and

metrics per paper, based on 10 random samples per

year from EMNLP proceedings over the past two

decades. More recent papers use more datasets and

metrics, while fewer of them report statistical signiﬁ-

cance test results.

In order to have a clearer idea regarding the way

model evaluation has evolved in our ﬁeld, we have

carried out our own analysis on a random sample

of EMNLP papers from the past two decades, and

present our results in Figure 1. It can be observed

that the number of evaluation datasets and metrics

per paper has increased over time, suggesting that

model evaluation is becoming increasingly com-

plex and heterogeneous. However, auxiliary tech-

niques such as testing for signiﬁcance, measuring

statistical power, and using appropriate sampling

methods have become less common, making re-

sults harder to judge when comparing new results

to previous work. We believe that while datasets

are now more easily accessible thanks to shared

repositories (Lhoest et al.,2021), model evalua-

tion is still unnecessarily cumbersome, with a frag-

mented ecosystem and a lack of consensus around

evaluation approaches and best practices.

The goal of this work is to address three practi-

cal challenges in model evaluation for ML: repro-

ducibility, centralization, and coverage.

Reproducibility: ML systems are extremely sen-

arXiv:2210.01970v2 [cs.LG] 6 Oct 2022

sitive to small (and often undocumented) choices

such as random seeds and hyperparameters (Pineau

et al.,2021). Model performance is often not com-

pared with proper statistical testing that takes this

variance into account, making many self-reported

comparisons unreliable. Our goal is to standardize

this process and thereby improve the reproduction

of ML evaluations.

Centralization: Historically, ML metrics have

been poorly documented, exacerbating an already

insufﬁcient community-wide understanding of their

usage and shortcomings (Post,2018). As metrics

and datasets change, the onus is on the commu-

nity to keep results up-to-date, causing unnecessary

replication of work (Ma et al.,2021) and the prolif-

eration of outdated artifacts (Luccioni et al.,2022).

Coverage: ML as a ﬁeld still focuses heavily on

accuracy-based metrics. While important, this fo-

cus glosses over other critical facets such as efﬁ-

ciency (Min et al.,2021), bias and fairness (Qian

et al.,2022), robustness (Goel et al.,2021), and

how these factor into choosing a model (Ethayarajh

and Jurafsky,2020;Ma et al.,2021).

We introduce the open source Evaluate library

and the Evaluation on the Hub platform to address

many of these problems. We believe that better

evaluation can happen, if we—as a community—

establish better best practices and remove hurdles.

2 Related work

Open-Source Tools for Evaluation

There is a

long history of open source projects aiming to cap-

ture various measurements, metrics and statisti-

cal testing methods for ML. Torchmetrics (Detlef-

sen et al.,2022) implements a large number of

model evaluation metrics for PyTorch (Paszke

et al.,2019), which is similar to evaluation metrics

found in Keras (Chollet et al.,2015) for Tensor-

Flow. Libraries like Scikit-learn (Pedregosa et al.,

2011), SciPy (Virtanen et al.,2020), Statsmod-

els (Seabold and Perktold,2010), NLTK (Bird et al.,

2009), TrecTools (Palotti et al.,2019), RL Relia-

bility Metrics (Chan et al.,2020), NetworkX (Hag-

berg et al.,2008), Scikit-image (Van der Walt

et al.,2014), GEM (Gehrmann et al.,2021),

TorchFidelity (Obukhov et al.,2020) also sup-

port many evaluation measures across many do-

mains. As integrating metrics into speciﬁc frame-

works can be difﬁcult, there are also many libraries

dedicated to individual evaluations for example

rouge_score,

BARTScore (Yuan et al.,2021), or

SacreBLEU (Post,2018). The fragmentation of

the ecosystem leads to various problems, such as a

wide range of incompatible conventions and APIs,

or misreporting due to differing implementations

and results.

In Evaluate, we provide one single interface

backed by a centralized Hub. Metrics can easily be

shared, are version controlled, have a standardized

interface, and allow for multimodal inputs.

Evaluation as a Service

The idea of Evaluation

as a Service (Ma et al.,2021;Kiela et al.,2021),

whereby models are submitted for another party to

be centrally evaluated, has recently gained traction

as a more reproducible way to conduct model eval-

uation. Central evaluation also facilitates holding

challenges and competitions around datasets (Ya-

dav et al.,2019;Pavao et al.,2022;Akhbardeh

et al.,2021) as opposed to simply evaluating self-

reported model results or comparing model scores

with benchmark suites (Bajaj et al.,2016;Coleman

et al.,2017;Wang et al.,2018,2019;Kardas et al.,

2020;Reddi et al.,2020;Liu et al.,2021;Goel et al.,

2021;Dror et al.,2019). The advantages of con-

ducting evaluation centrally are multiple, including

better reproducibility, forward/backward compati-

bility, and the ability to measure models along mul-

tiple axes of evaluation (e.g. efﬁciency and fairness,

in addition to accuracy), which can help contribute

towards a more systematic approach to evaluation.

Issues with Evaluation

Several studies of ML

research and practice have been carried out in re-

cent years on different aspects pertaining to ML

evaluation, and together they paint a bleak picture

of evaluation in our ﬁeld. For instance, a 2019

large-scale replication study of 255 ML papers

found that only 63% of the results they reported

could be systematically replicated (Raff,2019). A

complementary survey of 3,800 papers from Pa-

pers with Code has shown that a large majority of

metrics used do not adequately reﬂect models’ per-

formance and that they largely do not correlate with

human judgement (Blagec et al.,2021). Finally, a

recent study of 770 papers in machine translation

from the last decade found that while 108 new met-

rics have been proposed for the task, 99.8% of pa-

pers continue to use BLEU score for reporting re-

sults (Marie et al.,2021), despite the fact that the

github.com/google-research/google-

research/tree/master/rouge

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Evaluate&EvaluationontheHub:BetterBestPracticesforDataandModelMeasurementsLeandrovonWerra,LewisTunstall,AbhishekThakur,AlexandraSashaLuccioni,TristanThrush,AleksandraPiktus,FelixMarty,NazneenRajani,VictorMustar,HelenNgo,OmarSanseviero,Marioako,AlbertVillanova,QuentinLhoest,JulienChaumond,Marga...

展开>> 收起<<

Evaluate Evaluation on the Hub Better Best Practices for Data and Model Measurements Leandro von Werra Lewis Tunstall Abhishek Thakur Alexandra Sasha Luccioni.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Evaluate Evaluation on the Hub Better Best Practices for Data and Model Measurements Leandro von Werra Lewis Tunstall Abhishek Thakur Alexandra Sasha Luccioni

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: