Evaluate Evaluation on the Hub Better Best Practices for Data and Model Measurements Leandro von Werra Lewis Tunstall Abhishek Thakur Alexandra Sasha Luccioni

2025-05-06 0 0 1.72MB 9 页 10玖币
侵权投诉
Evaluate & Evaluation on the Hub:
Better Best Practices for Data and Model Measurements
Leandro von Werra
, Lewis Tunstall
, Abhishek Thakur
, Alexandra Sasha Luccioni
,
Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani,
Victor Mustar, Helen Ngo, Omar Sanseviero, Mario Šaško,
Albert Villanova, Quentin Lhoest, Julien Chaumond,
Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela
Hugging Face, Inc.
{leandro,lewis,abhishek,sasha.luccioni,douwe}@huggingface.co
Abstract
Evaluation is a key part of machine learning
(ML), yet there is a lack of support and tool-
ing to enable its informed and systematic prac-
tice. We introduce Evaluate and Evalua-
tion on the Hub—a set of tools to facilitate the
evaluation of models and datasets in ML.
Evaluate is a library to support best practices
for measurements, metrics, and comparisons
of data and models. Its goal is to support
reproducibility of evaluation, centralize and
document the evaluation process, and broaden
evaluation to cover more facets of model
performance. It includes over 50 efficient
canonical implementations for a variety of do-
mains and scenarios, interactive documenta-
tion, and the ability to easily share implemen-
tations and outcomes. The library is available
at https://github.com/huggingface/evaluate. In
addition, we introduce Evaluation on the Hub,
a platform that enables the large-scale evalua-
tion of over 75,000 models and 11,000 datasets
on the Hugging Face Hub, for free, at the click
of a button. Evaluation on the Hub is available
at https://huggingface.co/autoevaluate.
Demo screencast: youtu.be/6rU177zRj8Q
1 Introduction
Evaluation is a crucial cornerstone of machine
learning—not only can it help us gauge whether
and how much progress we are making as a field, it
can also help determine which model is most suit-
able for deployment in a given use case. However,
while the progress made in terms of hardware and
algorithms might look incredible to a ML practi-
tioner from several decades ago, the way we eval-
uate models has changed very little. In fact, there
is an emerging consensus that in order to meaning-
fully track progress in our field, we need to address
serious issues in the way in which we evaluate ML
systems (Kiela et al.,2021;Bowman and Dahl,
2021;Raji et al.,2021;Hutchinson et al.,2022).
Equal contribution.
Figure 1: Average number of evaluation datasets and
metrics per paper, based on 10 random samples per
year from EMNLP proceedings over the past two
decades. More recent papers use more datasets and
metrics, while fewer of them report statistical signifi-
cance test results.
In order to have a clearer idea regarding the way
model evaluation has evolved in our field, we have
carried out our own analysis on a random sample
of EMNLP papers from the past two decades, and
present our results in Figure 1. It can be observed
that the number of evaluation datasets and metrics
per paper has increased over time, suggesting that
model evaluation is becoming increasingly com-
plex and heterogeneous. However, auxiliary tech-
niques such as testing for significance, measuring
statistical power, and using appropriate sampling
methods have become less common, making re-
sults harder to judge when comparing new results
to previous work. We believe that while datasets
are now more easily accessible thanks to shared
repositories (Lhoest et al.,2021), model evalua-
tion is still unnecessarily cumbersome, with a frag-
mented ecosystem and a lack of consensus around
evaluation approaches and best practices.
The goal of this work is to address three practi-
cal challenges in model evaluation for ML: repro-
ducibility, centralization, and coverage.
Reproducibility: ML systems are extremely sen-
arXiv:2210.01970v2 [cs.LG] 6 Oct 2022
sitive to small (and often undocumented) choices
such as random seeds and hyperparameters (Pineau
et al.,2021). Model performance is often not com-
pared with proper statistical testing that takes this
variance into account, making many self-reported
comparisons unreliable. Our goal is to standardize
this process and thereby improve the reproduction
of ML evaluations.
Centralization: Historically, ML metrics have
been poorly documented, exacerbating an already
insufficient community-wide understanding of their
usage and shortcomings (Post,2018). As metrics
and datasets change, the onus is on the commu-
nity to keep results up-to-date, causing unnecessary
replication of work (Ma et al.,2021) and the prolif-
eration of outdated artifacts (Luccioni et al.,2022).
Coverage: ML as a field still focuses heavily on
accuracy-based metrics. While important, this fo-
cus glosses over other critical facets such as effi-
ciency (Min et al.,2021), bias and fairness (Qian
et al.,2022), robustness (Goel et al.,2021), and
how these factor into choosing a model (Ethayarajh
and Jurafsky,2020;Ma et al.,2021).
We introduce the open source Evaluate library
and the Evaluation on the Hub platform to address
many of these problems. We believe that better
evaluation can happen, if we—as a community—
establish better best practices and remove hurdles.
2 Related work
Open-Source Tools for Evaluation
There is a
long history of open source projects aiming to cap-
ture various measurements, metrics and statisti-
cal testing methods for ML. Torchmetrics (Detlef-
sen et al.,2022) implements a large number of
model evaluation metrics for PyTorch (Paszke
et al.,2019), which is similar to evaluation metrics
found in Keras (Chollet et al.,2015) for Tensor-
Flow. Libraries like Scikit-learn (Pedregosa et al.,
2011), SciPy (Virtanen et al.,2020), Statsmod-
els (Seabold and Perktold,2010), NLTK (Bird et al.,
2009), TrecTools (Palotti et al.,2019), RL Relia-
bility Metrics (Chan et al.,2020), NetworkX (Hag-
berg et al.,2008), Scikit-image (Van der Walt
et al.,2014), GEM (Gehrmann et al.,2021),
TorchFidelity (Obukhov et al.,2020) also sup-
port many evaluation measures across many do-
mains. As integrating metrics into specific frame-
works can be difficult, there are also many libraries
dedicated to individual evaluations for example
rouge_score,
1
BARTScore (Yuan et al.,2021), or
SacreBLEU (Post,2018). The fragmentation of
the ecosystem leads to various problems, such as a
wide range of incompatible conventions and APIs,
or misreporting due to differing implementations
and results.
In Evaluate, we provide one single interface
backed by a centralized Hub. Metrics can easily be
shared, are version controlled, have a standardized
interface, and allow for multimodal inputs.
Evaluation as a Service
The idea of Evaluation
as a Service (Ma et al.,2021;Kiela et al.,2021),
whereby models are submitted for another party to
be centrally evaluated, has recently gained traction
as a more reproducible way to conduct model eval-
uation. Central evaluation also facilitates holding
challenges and competitions around datasets (Ya-
dav et al.,2019;Pavao et al.,2022;Akhbardeh
et al.,2021) as opposed to simply evaluating self-
reported model results or comparing model scores
with benchmark suites (Bajaj et al.,2016;Coleman
et al.,2017;Wang et al.,2018,2019;Kardas et al.,
2020;Reddi et al.,2020;Liu et al.,2021;Goel et al.,
2021;Dror et al.,2019). The advantages of con-
ducting evaluation centrally are multiple, including
better reproducibility, forward/backward compati-
bility, and the ability to measure models along mul-
tiple axes of evaluation (e.g. efficiency and fairness,
in addition to accuracy), which can help contribute
towards a more systematic approach to evaluation.
Issues with Evaluation
Several studies of ML
research and practice have been carried out in re-
cent years on different aspects pertaining to ML
evaluation, and together they paint a bleak picture
of evaluation in our field. For instance, a 2019
large-scale replication study of 255 ML papers
found that only 63% of the results they reported
could be systematically replicated (Raff,2019). A
complementary survey of 3,800 papers from Pa-
pers with Code has shown that a large majority of
metrics used do not adequately reflect models’ per-
formance and that they largely do not correlate with
human judgement (Blagec et al.,2021). Finally, a
recent study of 770 papers in machine translation
from the last decade found that while 108 new met-
rics have been proposed for the task, 99.8% of pa-
pers continue to use BLEU score for reporting re-
sults (Marie et al.,2021), despite the fact that the
1
github.com/google-research/google-
research/tree/master/rouge
摘要:

Evaluate&EvaluationontheHub:BetterBestPracticesforDataandModelMeasurementsLeandrovonWerra,LewisTunstall,AbhishekThakur,AlexandraSashaLuccioni,TristanThrush,AleksandraPiktus,FelixMarty,NazneenRajani,VictorMustar,HelenNgo,OmarSanseviero,MarioŠaško,AlbertVillanova,QuentinLhoest,JulienChaumond,Marga...

展开>> 收起<<
Evaluate Evaluation on the Hub Better Best Practices for Data and Model Measurements Leandro von Werra Lewis Tunstall Abhishek Thakur Alexandra Sasha Luccioni.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:9 页 大小:1.72MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注